nutch抓取動(dòng)態(tài)網(wǎng)頁(yè)
Posted on 2010-04-24 19:06 泰仔在線 閱讀(2209) 評(píng)論(1) 編輯 收藏 所屬分類: 云計(jì)算相關(guān)解決搜索動(dòng)態(tài)內(nèi)容的問(wèn)題:
需要注意在conf下面的2個(gè)文件:regex-urlfilter.txt,crawl-urlfilter.txt
# skip URLs containing certain characters as probable queries, etc.
-[?*!@=] (-改+)
這段意思是跳過(guò)在連接中存在? * ! @ = 的頁(yè)面,因?yàn)槟J(rèn)是跳過(guò)所以,在動(dòng)態(tài)頁(yè)中存在?一般按照默認(rèn)的是不能抓取到的??梢栽谏厦?個(gè)文件中都修改成:
# skip URLs containing certain characters as probable queries, etc.
# -[?*!@=]
另外增加允許的一行
# accept URLs containing certain characters as probable queries, etc.
+[?=&]
意思是抓取時(shí)候允許抓取連接中帶 ? = & 這三個(gè)符號(hào)的連接
注意:兩個(gè)文件都需要修改,因?yàn)镹UTCH加載規(guī)則的順序是crawl-urlfilter.txt-> regex-urlfilter.txt
轉(zhuǎn)自:nutch抓取動(dòng)態(tài)網(wǎng)頁(yè)
需要注意在conf下面的2個(gè)文件:regex-urlfilter.txt,crawl-urlfilter.txt
# skip URLs containing certain characters as probable queries, etc.
-[?*!@=] (-改+)
這段意思是跳過(guò)在連接中存在? * ! @ = 的頁(yè)面,因?yàn)槟J(rèn)是跳過(guò)所以,在動(dòng)態(tài)頁(yè)中存在?一般按照默認(rèn)的是不能抓取到的??梢栽谏厦?個(gè)文件中都修改成:
# skip URLs containing certain characters as probable queries, etc.
# -[?*!@=]
另外增加允許的一行
# accept URLs containing certain characters as probable queries, etc.
+[?=&]
意思是抓取時(shí)候允許抓取連接中帶 ? = & 這三個(gè)符號(hào)的連接
注意:兩個(gè)文件都需要修改,因?yàn)镹UTCH加載規(guī)則的順序是crawl-urlfilter.txt-> regex-urlfilter.txt
轉(zhuǎn)自:nutch抓取動(dòng)態(tài)網(wǎng)頁(yè)