posts - 73, comments - 55, trackbacks - 0

假設(shè)我們的電腦的目錄中含有很多文本文檔，我們需要查找哪些文檔含有某個關(guān)鍵詞。為了實現(xiàn)這種功能，我們首先利用 Lucene 對這個目錄中的文檔建立索引，然后在建立好的索引中搜索我們所要查找的文檔。通過這個例子讀者會對如何利用 Lucene 構(gòu)建自己的搜索應(yīng)用程序有個比較清楚的認(rèn)識。

建立索引

為了對文檔進行索引，Lucene 提供了五個基礎(chǔ)的類，他們分別是 Document, Field, IndexWriter, Analyzer, Directory。下面我們分別介紹一下這五個類的用途：

Document

Document 是用來描述文檔的，這里的文檔可以指一個 HTML 頁面，一封電子郵件，或者是一個文本文件。一個 Document 對象由多個 Field 對象組成的。可以把一個 Document 對象想象成數(shù)據(jù)庫中的一個記錄，而每個 Field 對象就是記錄的一個字段。

Field

Field 對象是用來描述一個文檔的某個屬性的，比如一封電子郵件的標(biāo)題和內(nèi)容可以用兩個 Field 對象分別描述。

Analyzer

在一個文檔被索引之前，首先需要對文檔內(nèi)容進行分詞處理，這部分工作就是由 Analyzer 來做的。Analyzer 類是一個抽象類，它有多個實現(xiàn)。針對不同的語言和應(yīng)用需要選擇適合的 Analyzer。Analyzer 把分詞后的內(nèi)容交給 IndexWriter 來建立索引。

IndexWriter

IndexWriter 是 Lucene 用來創(chuàng)建索引的一個核心的類，他的作用是把一個個的 Document 對象加到索引中來。

Directory

這個類代表了 Lucene 的索引的存儲的位置，這是一個抽象類，它目前有兩個實現(xiàn)，第一個是 FSDirectory，它表示一個存儲在文件系統(tǒng)中的索引的位置。第二個是 RAMDirectory，它表示一個存儲在內(nèi)存當(dāng)中的索引的位置。

熟悉了建立索引所需要的這些類后，我們就開始對某個目錄下面的文本文件建立索引了，清單1給出了對某個目錄下的文本文件建立索引的源代碼。

清單 1. 對文本文件建立索引

package TestLucene;

import java.io.File;
import java.io.FileReader;
import java.io.Reader;
import java.util.Date;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexWriter;

/**
* This class demonstrate the process of creating index with Lucene
* for text files
*/
public class TxtFileIndexer {
public static void main(String[] args) throws Exception{
//indexDir is the directory that hosts Lucene's index files
File indexDir = new File("D:\\luceneIndex");
//dataDir is the directory that hosts the text files that to be indexed
File dataDir = new File("D:\\luceneData");
Analyzer luceneAnalyzer = new StandardAnalyzer();
File[] dataFiles = dataDir.listFiles();
IndexWriter indexWriter = new IndexWriter(indexDir,luceneAnalyzer,true);
long startTime = new Date().getTime();
for(int i = 0; i < dataFiles.length; i++){
if(dataFiles[i].isFile() && dataFiles[i].getName().endsWith(".txt")){
System.out.println(
"Indexing file " + dataFiles[i].getCanonicalPath());
Document document = new Document();
Reader txtReader = new FileReader(dataFiles[i]);
document.add(
Field.Text("path",dataFiles[i].getCanonicalPath()));
document.add(Field.Text("contents",txtReader));
indexWriter.addDocument(document);
}
}
indexWriter.optimize();
indexWriter.close();
long endTime = new Date().getTime();

System.out.println("It takes " + (endTime - startTime)
+ " milliseconds to create index for
the files in directory "
+ dataDir.getPath());
}
}

在清單1中，我們注意到類 IndexWriter 的構(gòu)造函數(shù)需要三個參數(shù)，第一個參數(shù)指定了所創(chuàng)建的索引要存放的位置，他可以是一個 File 對象，也可以是一個 FSDirectory 對象或者 RAMDirectory 對象。第二個參數(shù)指定了 Analyzer 類的一個實現(xiàn)，也就是指定這個索引是用哪個分詞器對文擋內(nèi)容進行分詞。第三個參數(shù)是一個布爾型的變量，如果為 true 的話就代表創(chuàng)建一個新的索引，為 false 的話就代表在原來索引的基礎(chǔ)上進行操作。接著程序遍歷了目錄下面的所有文本文檔，并為每一個文本文檔創(chuàng)建了一個 Document 對象。然后把文本文檔的兩個屬性：路徑和內(nèi)容加入到了兩個 Field 對象中，接著在把這兩個 Field 對象加入到 Document 對象中，最后把這個文檔用 IndexWriter 類的 add 方法加入到索引中去。這樣我們便完成了索引的創(chuàng)建。接下來我們進入在建立好的索引上進行搜索的部分。

?

搜索文檔

利用Lucene進行搜索就像建立索引一樣也是非常方便的。在上面一部分中，我們已經(jīng)為一個目錄下的文本文檔建立好了索引，現(xiàn)在我們就要在這個索引上進行搜索以找到包含某個關(guān)鍵詞或短語的文檔。Lucene提供了幾個基礎(chǔ)的類來完成這個過程，它們分別是呢IndexSearcher, Term, Query, TermQuery, Hits. 下面我們分別介紹這幾個類的功能。

Query

這是一個抽象類，他有多個實現(xiàn)，比如TermQuery, BooleanQuery, PrefixQuery. 這個類的目的是把用戶輸入的查詢字符串封裝成Lucene能夠識別的Query。

Term

Term 是搜索的基本單位，一個Term對象有兩個String類型的域組成。生成一個Term對象可以有如下一條語句來完成：Term term = new Term(“fieldName”,”queryWord”); 其中第一個參數(shù)代表了要在文檔的哪一個Field上進行查找，第二個參數(shù)代表了要查詢的關(guān)鍵詞。

TermQuery

TermQuery 是抽象類Query的一個子類，它同時也是Lucene支持的最為基本的一個查詢類。生成一個TermQuery對象由如下語句完成： TermQuery termQuery = new TermQuery(new Term(“fieldName”,”queryWord”)); 它的構(gòu)造函數(shù)只接受一個參數(shù)，那就是一個Term對象。

IndexSearcher

IndexSearcher是用來在建立好的索引上進行搜索的。它只能以只讀的方式打開一個索引，所以可以有多個IndexSearcher的實例在一個索引上進行操作。

Hits

Hits是用來保存搜索的結(jié)果的。

介紹完這些搜索所必須的類之后，我們就開始在之前所建立的索引上進行搜索了，清單2給出了完成搜索功能所需要的代碼。

清單2 ：在建立好的索引上進行搜索

package TestLucene;

import java.io.File;

import org.apache.lucene.document.Document;
import org.apache.lucene.index.Term;
import org.apache.lucene.search.Hits;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.TermQuery;
import org.apache.lucene.store.FSDirectory;

/**
* This class is used to demonstrate the
* process of searching on an existing
* Lucene index
*
*/
public class TxtFileSearcher {
public static void main(String[] args) throws Exception{
String queryStr = "lucene";
//This is the directory that hosts the Lucene index
File indexDir = new File("D:\\luceneIndex");
FSDirectory directory = FSDirectory.getDirectory(indexDir,false);
IndexSearcher searcher = new IndexSearcher(directory);
if(!indexDir.exists()){
System.out.println("The Lucene index is not exist");
return;
}
Term term = new Term("contents",queryStr.toLowerCase());
TermQuery luceneQuery = new TermQuery(term);
Hits hits = searcher.search(luceneQuery);
for(int i = 0; i < hits.length(); i++){
Document document = hits.doc(i);
System.out.println("File: " + document.get("path"));
}
}
}

在清單2中，類IndexSearcher的構(gòu)造函數(shù)接受一個類型為Directory的對象，Directory是一個抽象類，它目前有兩個子類： FSDirctory和RAMDirectory. 我們的程序中傳入了一個FSDirctory對象作為其參數(shù)，代表了一個存儲在磁盤上的索引的位置。構(gòu)造函數(shù)執(zhí)行完成后，代表了這個 IndexSearcher以只讀的方式打開了一個索引。然后我們程序構(gòu)造了一個Term對象，通過這個Term對象，我們指定了要在文檔的內(nèi)容中搜索包含關(guān)鍵詞”lucene”的文檔。接著利用這個Term對象構(gòu)造出TermQuery對象并把這個TermQuery對象傳入到 IndexSearcher的search方法中進行查詢，返回的結(jié)果保存在Hits對象中。最后我們用了一個循環(huán)語句把搜索到的文檔的路徑都打印了出來。好了，我們的搜索應(yīng)用程序已經(jīng)開發(fā)完畢，怎么樣，利用Lucene開發(fā)搜索應(yīng)用程序是不是很簡單。
?

posted on 2006-07-14 08:45 保爾任閱讀(544) 評論(0) 編輯收藏所屬分類: open source

新用戶注冊刷新評論列表


只有注冊用戶登錄后才能發(fā)表評論。




網(wǎng)站導(dǎo)航: 博客園 IT新聞 Chat2DB C++博客博問管理
相關(guān)文章: 使用sitemesh建立復(fù)合視圖 Jakarta項目的commons包 JfreeChart學(xué)習(xí)總結(jié) 利用Java＋POI 讀寫Excel文檔&向Excel中插入圖片 Log4J 最佳實踐之全能配置文件基于java使用FCKeditor laoer說天乙社區(qū) [未讀]簡單的用lucene搜索的應(yīng)用程序基于Java的全文索引引擎Lucene（未讀） SmartUpload上傳下載及文件名和文件內(nèi)容中文問題

<

2025年7月

>

日

一

二

三

四

五

六

29

30

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

1

2

3

4

5

6

7

8

9

常用鏈接

留言簿(4)

隨筆分類

隨筆檔案

文章分類

文章檔案

搜索

最新評論

閱讀排行榜

評論排行榜