posts - 33, comments - 70, trackbacks - 0

轉:http://www.zdnet.com.cn/developer/webdevelop/story/0,2000081602,39154640,00.htm

用Lucene來建立一個索引

給你的Web網站加上搜索的功能是增強用戶瀏覽體驗的最簡單方式之一，但是在你的應用程序里集成一個搜索引擎并不總是很容易。為了幫助你為自己的Java應用程序提供一個靈活的搜索引擎，我會講解如何使用Lucene，它是一個極其靈活的開放源代碼的搜索引擎。

Lucene會直接同你的Web應用程序集成到一起。它是由Jakarta Apache工作組使用Java編寫成的。你的Java應用程序能夠將Lucene作為任何搜索功能的核心來使用。Lucene能夠處理任何類型的文本數據；但是它沒有內置對Word、Excel、PDF和XML的支持。但是還是有一些解決方案能夠讓Lucene支持它們中的每一個。

關于Lucene的重要一點是，它只是一個搜索引擎，因此沒有內置Web圖形用戶界面和Web crawler。要把Lucene集成到你的Web應用程序里，你就要編寫一個顯示查詢表單的servlet或者JSP頁面，還要編寫另一個列出結果的頁面。

用Lucene來建立一個索引

你應用程序的文本內容由Lucene來索引，并被作為一系列索引文件保存在文件系統里。Lucene能夠接受代表單篇內容的文檔（Document）對象，例如一個Web頁面或者PDF文件。你的應用程序就負責將其內容轉變成Lucene能夠理解的文檔對象。

每個文檔都是由有一個或者多個的字段（Field）對象。這些字段包含有一個名稱和一個值，非常像散裂圖里的一個項目（entry）。每個字段都應該對應一段信息，這段信息是同你需要查詢或者顯示的檢索結果相關的。例如，標題應該被用在搜索結果里，因此它會被作為一個字段添加到文檔對象里。這些字段可以被索引，也可以不被索引，而原始的數據也可以選擇保存在索引里。保存在索引里的字段在創建檢索結果頁面的時候會很有用。對于搜索沒有用處的字段，例如唯一的ID，就不需要被索引，只需要被保存就行了。

字段也可以是標記化了的（tokenized），這就意味著一個分析程序會將輸入到字段里的內容分解成搜索引擎能夠使用的標記。Lucene帶有多個分析程序，但是我只會使用最強大的分析程序——StandardAnalyzer類。

StandardAnalyzer類會將文本的所有內容變成小寫的，并去掉一些常用的停頓詞（stop word）。停頓詞是像“a”、“the”和“in”這樣的詞，它們都是內容里非常常見的詞，但是對搜索卻一點用處都沒有。分析程序也會分析搜索查詢，這就意味著查詢會找到匹配的部分。例如，這段文本“The dog is a golden retriever（這條狗是一只金毛獵犬）”，就會被處理為“dog golden retriever”作為索引。當用戶搜索“a Golden Dog”的時候，分析程序會處理這個查詢，并將其轉變為“golden dog”，這就符合我們的內容了。

我們的例子準備使用數據訪問對象（Data Access Object，DAO）的商務對象（business object），前者是Java應用程序開發的一個常見模式。我要使用的DAO——ProductDAO見Listing A。

Listing?A
?2

package ?com.greenninja.lucene;?
?3

?
?4

importjava.util. * ;?
?5

public ? class ?ProductDAO {
?6

?????? private ?Map?map? = ? new ?HashMap();??????
?7

?????? /**
?8

???????*?Initializes?the?map?with?new?Products
?9

???????*
10

??????? */
11

?????? public ? void ?init()?????? {
12

?????????????
13

?????????????Product?product1? = ? new ?Product( " 1E344 " , " Blizzard?Convertible " ,
14

????????????? " The?Blizzard?is?the?finest?convertible?on?the?market?today,?with?120?horsepower,?6?seats,?and?a?steering?wheel. " ,
15

????????????? " The?Blizzard?convertible?model?is?a?revolutionary?vehicle?that?looks?like?a?minivan,?but?has?a?folding?roof?like?a?roadster.?We?took?all?of?the?power?from?our?diesel?engines?and?put?it?into?our?all?new?fuel?cell?power?system. " );
16

?????????????map.put(product1.getId(),product1);
17

?????????????
18

?????????????Product?product2? = ? new ?Product( " R5TS7 " , " Truck?3000 " ,
19

????????????? " Our?Truck?3000?model?comes?in?all?shapes?and?sizes,?including?dump?truck,?garbage?truck,?and?pickup?truck.?The?garbage?truck?has?a?full?3?year?warranty. " ,
20

????????????? " The?Truck?3000?is?built?on?the?same?base?as?our?bulldozers?and?can?be?outfitted?with?an?optional?hovercraft?attachment?for?all-terrain?travel. " );
21

?????????????map.put(product2.getId(),product2);
22

?????????????
23

?????????????Product?product3? = ? new ?Product( " VC456 " , " i954d-b?Motorcycle " ,
24

????????????? " The?motorcycle?comes?with?a?sidecar?on?each?side,?for?additional?stability?and?cornering?ability. " ,
25

????????????? " Our?motorcycle?has?the?same?warranty?as?our?other?products?and?is?guaranteed?for?many?miles?of?fun?biking.?Each?motorcycle?is?shipped?with?a?nylon?windbreaker,?goggles,?and?a?helmet?with?a?neat?visor. " );
26

?????????????map.put(product3.getId(),product3);
27

??????????????????????????
28

??????}
29

??????
30

?????? /**
31

???????*?Gets?a?collection?of?all?of?the?products
32

???????*
33

???????*? @return ?all?of?the?products
34

??????? */
35

?????? public ?Collection?getAllProducts()
36

?????? {
37

????????????? return ?map.values();
38

??????}
39

??????
40

?????? /**
41

???????*?Gets?a?product,?given?the?unique?id
42

???????*
43

???????*? @param ?id?the?unique?id
44

???????*? @return ?the?Product?object,?or?null?if?the?id?wasn't?found
45

??????? */
46

?????? public ?Product?getProduct(String?id)
47

?????? {
48

????????????? if ?(map.containsKey(id))
49

????????????? {
50

???????????????????? return ?(Product)?map.get(id);
51

?????????????}
52

?????????????
53

????????????? // the?product?id?wasn't?found
54

????????????? return ? null ;
55

??????} ?
56

?
57

}
58

?
59

為了讓這個演示程序簡單，我不準備使用數據庫，DAO也只會包含產品（Product）對象的一個集合。在本例里，我會采用Listing B

Listing?B?

package ?com.greenninja.lucene;?

public ? class ?Product

{

?????? private ?String?name;

?????? private ?String?shortDescription;

?????? private ?String?longDescription;??????

?????? private ?String?id;?

??????

?????? /**

???????*?Constructor?to?create?a?new?product

??????? */

?????? public ?Product(String?i,?String?n,?String?sd,?String?ld)

?????? {

????????????? this .id? = ?i;

????????????? this .name = ?n;

????????????? this .shortDescription? = ?sd;

????????????? this .longDescription? = ?ld;

??????}

??????setter / getter?

}

里的產品對象，并將它們轉變成為用于索引的文檔。

索引符（Indexer）類在Listing C

Listing?C??

package ?com.greenninja.lucene;?

import ?java.io.IOException;

import ?java.util.Collection;

import ?java.util.Iterator;?

import ?org.apache.lucene.analysis.Analyzer;

import ?org.apache.lucene.analysis.standard.StandardAnalyzer;

import ?org.apache.lucene.document.Document;

import ?org.apache.lucene.document.Field;

import ?org.apache.lucene.index.IndexWriter;?

public ? class ?Indexer

?????? protected ?IndexWriter?writer? = ? null ;

??????

?????? protected ?Analyzer?analyzer? = ? new ?StandardAnalyzer();

??????

?????? public ? void ?init(String?indexPath)? throws ?IOException

?????? {

????????????

????????????? // create?a?new?index?every?time?this?is?run

?????????????writer? = ? new ?IndexWriter(indexPath,?analyzer,? true );

??????}

??????

?????? public ? void ?buildIndex()? throws ?IOException

?????? {

????????????? // get?the?products?from?the?DAO

?????????????ProductDAO?dao? = ? new ?ProductDAO();

?????????????dao.init();

?????????????Collection?products? = ?dao.getAllProducts();

?????????????

?????????????Iterator?iter? = ?products.iterator();

?????????????

????????????? while ?(iter.hasNext())

????????????? {

????????????????????Product?product? = ?(Product)?iter.next();

????????????????????

???????????????????? // ?convert?the?product?to?a?document.

????????????????????Document?doc? = ? new ?Document();

????????????????????

???????????????????? // ?create?an?unindexed,?untokenized,?stored?field?for?the?product?id

????????????????????doc.add(Field.UnIndexed( " productId " ,product.getId()));

????????????????????

???????????????????? // ?create?an?indexed,?untokenized,?stored?field?for?the?name

????????????????????doc.add(Field.Keyword( " name " ,product.getName()));

????????????????????

???????????????????? // ?create?an?indexed,?untokenized,?stored?field?for?the?short?description

???????????????????doc.add(Field.Keyword( " short " ,product.getShortDescription()));

????????????????????

???????????????????? // ?create?an?indexed,?tokenized,?unstored?field?for?all?of?the?content

???????????????????String?content? = ?product.getName()? + ? " ? " ? + ?product.getShortDescription()? +

?????????????????????????? " ? " ? + ?product.getLongDescription();

????????????????????doc.add(Field.Text( " content " ,content));

????????????????????

???????????????????? // ?add?the?document?to?the?index

???????????????????? try

???????????????????? {

??????????????????????????writer.addDocument(doc);

??????????????????????????System.out.println( " Document? " ? + ?product.getName()? + ? " ?added?to?index. " );

????????????????????}

???????????????????? catch ?(IOException?e)

???????????????????? {

??????????????????????????System.out.println( " Error?adding?document:? " ? + ?e.getMessage());

????????????????????} ????????????????????

?????????????} ?????????????

????????????? // optimize?the?index

?????????????writer.optimize();

?????????????

????????????? // close?the?index

?????????????writer.close();?????????????

??????} ????????????

}

里，它將負責把Product轉換成為Lucene文檔，還負責創建Lucene索引。

產品類里的字段是ID名、簡短描述和詳細描述。通過使用字段（Field）類的UnIndexed方法，ID會被作為一個非索引的非標記字段被保存。通過使用字段類的Keyword方法，名稱和簡短描述會被作為索引的非標記字段被保存。搜索引擎會對內容字段進行查詢，而內容字段里會包含有產品的名稱、簡短描述和詳細描述字段。

在所有的文檔都添加完之后，就要優化索引并關閉索引編寫器，這樣你才能夠使用索引。Lucene的大多數實現都要使用增量索引（incremental indexing），在增量索引里，已經在索引里的文檔都是獨立更新的，而不是每次先刪除索引再創建一個新的。

運行查詢

運行查詢

創建一個查詢并在索引里搜索結果要比創建一個索引簡單。你的應用程序會要求使用者提供一個搜索查詢，這個查詢可以是一個簡單的詞語。Lucene擁有一些更加高級的查詢（Query）類，用于布爾搜索或者整句搜索。

高級查詢的一個例子是”Mutual Fund”（互惠基金）AND stock*（股票），它會搜索包含有短語Mutual Fund和以stock開頭的詞（例如stocks、stock或者甚至是stockings）的文檔。

獲取更多關于Lucene里查詢的信息
Lucene Web網站里的句法頁面會提供更加詳細的信息。

搜索符（Searcher）類放在Listing D

Listing?D?

package ?com.greenninja.lucene;?

import ?java.io.IOException;?

import ?org.apache.lucene.analysis.Analyzer;

import ?org.apache.lucene.analysis.standard.StandardAnalyzer;

import ?org.apache.lucene.queryParser.ParseException;

import ?org.apache.lucene.queryParser.QueryParser;

import ?org.apache.lucene.search.Hits;

import ?org.apache.lucene.search.IndexSearcher;

import ?org.apache.lucene.search.Query;?

public ? class ?Searcher

{

?????? protected ?Analyzer?analyzer? = ? new ?StandardAnalyzer();

?????? public ?Hits?search(String?indexPath,?String?queryString)? throws ?IOException,?ParseException

?????? {

????????????? // the?Lucene?index?Searcher?class,?which?uses?the?query?on?the?index

?????????????IndexSearcher?indexSearcher? = ? new ?IndexSearcher(indexPath);

?????????????

????????????? // ?make?the?query?with?our?content?field,?the?query?string,?and?the?analyzer

?????????????Query?query? = ?QueryParser.parse(queryString, " content " ,analyzer);

?????????????

?????????????Hits?hits? = ?indexSearcher.search(query);

?????????????

????????????? return ?hits;?

??????} ??????

}

里，它負責在Lucene索引里查找你所使用的詞語。對于本篇演示程序而言，我使用了一個簡單的查詢，它只是一個字符串，而沒有使用任何高級查詢功能。我用QueryParser類從查詢字符串里創建了一個查詢（Query）對象，QueryParser這個類會使用StandardAnalyzer類將查詢字符串分解成標記，再去掉停頓詞，然后將這個字符串轉換成小寫的。

這個查詢被傳遞給一個IndexSearcher對象。IndexSearcher會在索引的文件系統里被初始化。IndexSearcher的搜索方法將接受這個查詢并返回一個命中（Hits）對象。這個命中對象包含有作為Lucene文檔對象的檢索結果，以及結果的長度。使用命中對象的Doc方法將取回命中對象里的每個文檔。

文檔對象包含有我添加到索引符文檔里的字段。這些字段中的一些被保存了，但是沒有被標記化，你可以將它們從文檔里提取出來。示例應用程序會用搜索引擎運行一個查詢，然后顯示它所找到的產品名稱。

運行演示程序

要運行本文里的示例程序，你需要從Lucene的Web網站下載最新版本的Lucene二進制發布版本（binary distribution）。Lucene發行版的lucene-1.3-rc1.jar文件需要被添加到你Java類的路徑下才能夠運行這個演示程序。演示程序會在運行com.greenninja.lucene.Demo類的目錄下創建一個叫做index的索引目錄。你還需要安裝好JDK。一行典型的命令是：java -cp c:\java\lucene-1.3-rc1\lucene-1.3-rc1.jar;. com.greenninja.lucene.Demo（見圖A）。本例所使用的示例數據包含在ProductDAO類里。這個查詢是演示（Demo）類的一部分。

圖A

??命令行示例

參考資料
· 下載本文相關代碼
·javaworld.com:javaworld.com
·Matrix-Java開發者社區:http://www.matrix.org.cn/
·Lucene 搜索引擎庫:
http://jakarta.apache.org/lucene/docs/index.html
·MAOS 開源項目:
http://sourceforge.net/projects/maos/

posted on 2006-05-15 22:54 地獄男爵(hellboys) 閱讀(1065) 評論(0) 編輯收藏所屬分類: 編程語言(c/c++ java python sql ......)

新用戶注冊刷新評論列表


只有注冊用戶登錄后才能發表評論。




網站導航: 博客園 IT新聞 Chat2DB C++博客博問管理
相關文章: 優化MySQL數據庫性能的八種方法 ActiveMQ4.1 +Spring2.0的POJO JMS方案擴展，以更加實用（基于ss）.二 ActiveMQ4.1 +Spring2.0的POJO JMS方案擴展，以更加實用（基于ss） compass 中使用annatation 簡化配置 Compass - springside 中的應用 HTMLParser屬性解析使用Lucene建立自己的搜索引擎初步(轉)

2006年5月

日

一

二

三

四

五

六

常用鏈接

隨筆分類

隨筆檔案

文章檔案

2005年12月 (1)

相冊

連接

差沙
我以前blog地址
聰明的豬(cleverpig)

用Lucene來建立一個索引

運行查詢

運行演示程序

常用鏈接

隨筆分類

隨筆檔案

文章檔案

相冊

連接

最新隨筆

搜索

最新評論

閱讀排行榜

評論排行榜