亚洲国产尤物,亚洲九九爱视频,日本黄色免费在线

http://www.aygfsteel.com/xiaodaoxiaodao/articles/103441.html ??

轉載請注明來源/作者

淺入淺出nutch 0.8使用指南4windows

Nutch 是一個搜索引擎，昨天剛從一個朋友那里知道，前一陣子接觸了 lucene ，對搜索的東西躍躍欲試，趁著周末試用了一把，感覺蠻新鮮，網上的例子多是基于 0.7 版本的，找到了一些 0.8 的就是跑不起來，忽悠忽悠試了半天，寫下一點感覺 ~~

系統環境： Tomcat 5.0.12/JDK1.5/nutch0.8.1/cygwin-cd-release-20060906.iso

使用過程：

1．因為 nutch 的運行需要 unix 環境，所以對于 windows 用戶，要先下載一個 cygwin ，它是一個自由軟件，可在 windows 下模擬 unix 環境，你可以到 http://www.cygwin.com/ 下載在線安裝程序，也可以到 http://www-inst.eecs.berkeley.edu/~instcd/iso/ 下載完整安裝程序（我下下來有 1.27G ，呵呵，要保證硬盤空間足夠大 ~~ ），安裝時一路 next 即可 ~~~

2．下載 nutch0.8.1 ，下載地址 http://apache.justdn.org/lucene/nutch/ ，我下載后是解壓到 D:\ nutch-0.8.1

3．在 nutch-0.8.1 新建文件夾 urls ，在 urls 建一文本文件，文件名任意，添加一行內容： http://lucene.apache.org/nutch ，這是要搜索的網址

4．打開 nutch-0.8.1 下的 conf ，找到 crawl-urlfilter.txt ，找到這兩行

# accept hosts in MY.DOMAIN.NAME

+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/

紅色部分是一個正則，你要搜索的網址要與其匹配，在這里我改為 +^http://([a-z0-9]*\.)*apache.org/

5． OK ，下面開始對搜索網址建立索引，運行 cygwin ，會打開一個命令窗口，輸入 ”cd cygdrive/d/ nutch-0.8.1” ，轉到 nutch-0.8.1 目錄

6．執行 ”bin/nutch crawl urls -dir crawled-depth 2 -threads 5 >& crawl.log”

參數意義如下（來自 apache 網站 http://lucene.apache.org/nutch/tutorial8.html ）：

-dir dir names the directory to put the crawl in.

-threads threads determines the number of threads that will fetch in parallel.

-depth depth indicates the link depth from the root page that should be crawled.

-topN N determines the maximum number of pages that will be retrieved at each level up to the depth.

crawl.log ：日志文件

執行后可以看到 nutch-0.8.1 下新增一個 crawled 文件夾，它下面有 5 個文件夾：

① / ② crawldb/ linkdb ： web link 目錄，存放 url 及 url 的互聯關系，作為爬行與重新爬行的依據，頁面默認 30 天過期 （可以在 nutch-site.xml 中配置，后面會提到）

③ segments ：一存放抓取的頁面，與上面鏈接深度 depth 相關， depth 設為 2 則在 segments 下生成兩個以時間命名的子文件夾，比如 ” 20061014163012” ，打開此文件夾可以看到，它下面還有 6 個子文件夾，分別是（來自 apache 網站 http://lucene.apache.org/nutch/tutorial8.html ）：

crawl_generate ： names a set of urls to be fetched

crawl_fetch ： contains the status of fetching each url

content ： contains the content of each url

parse_text ： contains the parsed text of each url

parse_data ： contains outlinks and metadata parsed from each url

crawl_parse ： contains the outlink urls, used to update the crawldb

④ indexes ：索引目錄，我運行時生成了一個 ” part-00000” 的文件夾，

⑤ index ： lucene 的索引目錄（ nutch 是基于 lucene 的，在 nutch-0.8.1\lib 下可以看到 lucene-core-1.9.1.jar ，最后有 luke 工具的簡單使用方法），是 indexs 里所有 index 合并后的完整索引，注意索引文件只對頁面內容進行索引，沒有進行存儲，因此查詢時要去訪問 segments 目錄才能獲得頁面內容

7．進行簡單測試，在 cygwin 中輸入 ”bin/nutch org.apache.nutch.searcher.NutchBean apache” ，即調用 NutchBean 的 main 方法搜索關鍵字 ”apache” ，在 cygwin 可以看到搜索出： Total hits: 29 （ hits 相當于 JDBC 的 results ）

注意： 如果發現搜索結果始終為 0 ，則需要配置一下 nutch-0.8.1 \conf 的 nutch-site.xml ，配置內容和下面過程 9 的配置相同 ( 另外，過程 6 中 depth 如果設為 1 也可能造成搜索結果為 0) ，然后重新執行過程 6

8．下面我們要在 Tomcat 下進行測試， nutch-0.8.1 下面有 nutch-0.8.1.war ，拷貝到 Tomcat\webapps 下，可以直接用 winrar 解壓到此目錄下，我是用 Tomcat 啟動后解壓的，解壓文件夾名為： nutch

9．打開 nutch\WEB-INF\classes 下 nutch-site.xml 文件，下面紅色為需要新增的內容，其他為原 nutch-site.xml 內容

<?xml version="1.0"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<property>

? <name>http.agent.name</name>

? <value>*</value>

? <description></description>

</property>

<property>

? <name>searcher.dir</name>

? <value>D:\nutch-0.8.1\crawled</value>

? <description></description>

</property>

</configuration>

http.agent.name ：必須，如果去掉這個 property 查詢結果始終為 0

searcher.dir ：指定前面在 cygwin 中生成的 crawled 路徑

其中我們還可以設置重新爬行時間（在過程 6 提到：頁面默認 30 天過期 ）

?<name>fetcher.max.crawl.delay</name>

?<value>30</value>

?<description></description>

</property>

另外還有很多參數可以在 nutch-0.8.1\conf 下的 nutch-default.xml 查詢， nutch-default.xml 中的 property 配置都帶有注釋，有興趣的可以分別拷貝到 Tomcat\webapps\nutch\WEB-INF\classes\nutch-site.xml 中進行調試

10．???????????? 打開 http://localhost:8081/nutch ，輸入 ”apache” ，可以看到 ” 共有 29 項查詢結果 ” ，和上面在過程 7 進行簡單測試的結果一致

Luke 介紹：

Luke 是查詢 lucene 索引文件的圖形化工具，可以比較直觀的看到索引創建情況，它需要結合 lucene 包一起用

使用過程：

1．下載地址 http://www.getopt.org/luke 它提供 3 種下載：

standalone full JAR ： lukeall.jar

standalone minimal JAR ： lukemin.jar

separate JARs ： luke.jar (~113kB)

lucene-1.9-rc1-dev.jar (~380kB)

analyzers-dev.jar (~348kB)

snowball-1.1-dev.jar (~88kB)

js.jar (~492kB)

我們只需下載 ”separate JARs” 的 luke.jar 即可

2．下載后新建一個文件夾，比如叫 ”luke_run” ，把 luke.jar 放在文件夾下，同時從 nutch-0.8.1\lib 下拷貝 lucene-core-1.9.1.jar 到此文件夾下

3．在 cmd 命令行中轉到 ”luke_run” 目錄，輸入 ” java -classpath luke.jar;lucene-core-1.9.1.jar org.getopt.luke.Luke ” ，可以看到打開 luke 圖形界面，從 ”File”==>”Open Lucene index” ，打開 ”nutch-0.8.1\crawled\index” 文件夾（在上面過程 6 已創建），然后可以在 luke 中看到索引創建的詳細信息

4．附上一點閑言：）使用中發現一個問題（在 lucene-core-1.9.1.jar 中不存在，所以 luke 不會拋此 Exception ），就是 ”Documents” 中 ”Reconstruct&Edit” 按鈕只要一點，就會拋一個 Exception ：

Exception in thread "Thread-12" java.lang.NoSuchMethodError: org.apache.lucene.d

ocument.Field.<init>(Ljava/lang/String;Ljava/lang/String;ZZZZ)V

? ??????at org.getopt.luke.Luke$2.run(Unknown Source)

發表于 2006-10-20 20:22 藍小刀閱讀(959) 評論(0) 編輯收藏所屬分類: JAVA

淺入淺出nutch 0.8使用指南4windows

公告

留言簿(6)

隨筆分類(53)

隨筆檔案(42)

往日空間，棄之可惜

藍小刀個人空間

最新隨筆

搜索

積分與排名

最新評論

閱讀排行榜

評論排行榜

xiaodaoxiaodao——藍小刀的自言自語黑夜給了我寂寞的心情，我卻用它來尋找愛情~~
BlogJava \| 首頁 \| 發新隨筆 \| 發新文章 \| 聯系 \| 聚合 \| 管理	隨筆：42 文章：0 評論：228 引用：0