Sat, 01 Nov 2008 09:57:00 GMT

Lucene 软�g包的发布形式是一�?JAR 文�g�Q�下面我们分析一下这�?JAR 文�g里面的主要的 JAVA 包，使读者对之有个初步的了解�?/p>

Package: org.apache.lucene.document

�q�个包提供了一些�ؓ��装要烦引的文档所需要的�c�，比如 Document, Field。这��P��每一个文档最�l�被��装成了一�?Document 对象�?/p>

Package: org.apache.lucene.analysis

�q�个包主要功能是�Ҏ��档进行分词，因�ؓ文档在徏立烦引之前必��要�q�行分词�Q�所以这个包的作用可以看成是为徏立烦引做准备工作�?/p>

Package: org.apache.lucene.index

�q�个包提供了一些类来协助创建烦引以及对创徏好的索引�q�行更新。这里面有两个基��的类�Q�IndexWriter �?IndexReader�Q�其�?IndexWriter 是用来创建烦引�ƈ��d��文档到烦引中的，IndexReader 是用来删除烦引中的文档的�?/p>

Package: org.apache.lucene.search

�q�个包提供了对在建立好的索引上进行搜索所需要的�c�R��比�?IndexSearcher �?Hits, IndexSearcher 定义了在指定的烦引上�q�行搜烦的方法，Hits 用来保存搜烦得到的结果�?/p>

建立索引

��Z��Ҏ��档进行烦引，Lucene 提供了五个基��的类�Q�他们分别是 Document, Field, IndexWriter, Analyzer, Directory。下面我们分别介�l�一下这五个�cȝ��用途：

Document

Document 是用来描�q�文档的�Q�这里的文档可以指一�?HTML ��面�Q�一��电子邮�Ӟ��或者是一个文本文件。一�?Document 对象由多�?Field 对象�l�成的。可以把一�?Document 对象惌��成数据库中的一个记录，而每�?Field 对象��是记录的一个字�D�c�?/p>

Field

Field 对象是用来描�q�C��个文档的某个属性的�Q�比如一��电子邮件的标题和内容可以用两个 Field 对象分别描述�?/p>

Analyzer

在一个文档被索引之前�Q�首先需要对文档内容�q�行分词处理�Q�这部分工作��是�?Analyzer 来做的。Analyzer �c�L��一个抽象类�Q�它有多个实现。针对不同的语言和应用需要选择适合�?Analyzer。Analyzer 把分词后的内容交�l?IndexWriter 来徏立烦引�?/p>

IndexWriter

IndexWriter �?Lucene 用来创徏索引的一个核心的�c�，他的作用是把一个个�?Document 对象加到索引中来�?/p>

Directory

�q�个�c�M��表了 Lucene 的烦引的存储的位�|�，�q�是一个抽象类�Q�它目前有两个实玎ͼ��W�一个是 FSDirectory�Q�它表示一个存储在文�g�pȝ��中的索引的位�|�。第二个�?RAMDirectory�Q�它表示一个存储在内存当中的烦引的位置�?/p>

熟悉了徏立烦引所需要的�q�些�c�d��Q�我们就开始对某个目录下面的文本文件徏立烦引了�Q�清�?�l�出了对某个目录下的文本文�g建立索引的源代码�?/p>

清单 1. �Ҏ��本文件徏立烦�?/strong>

package TestLucene; import java.io.File; import java.io.FileReader; import java.io.Reader; import java.util.Date; import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; import org.apache.lucene.index.IndexWriter; /** * This class demonstrate the process of creating index with Lucene * for text files */ public class TxtFileIndexer { public static void main(String[] args) throws Exception{ //indexDir is the directory that hosts Lucene's index files File indexDir = new File("D:\\luceneIndex"); //dataDir is the directory that hosts the text files that to be indexed File dataDir = new File("D:\\luceneData"); Analyzer luceneAnalyzer = new StandardAnalyzer(); File[] dataFiles = dataDir.listFiles(); IndexWriter indexWriter = new IndexWriter(indexDir,luceneAnalyzer,true); long startTime = new Date().getTime(); for(int i = 0; i < dataFiles.length; i++){ if(dataFiles[i].isFile() && dataFiles[i].getName().endsWith(".txt")){ System.out.println("Indexing file " + dataFiles[i].getCanonicalPath()); Document document = new Document(); Reader txtReader = new FileReader(dataFiles[i]); document.add(Field.Text("path",dataFiles[i].getCanonicalPath())); document.add(Field.Text("contents",txtReader)); indexWriter.addDocument(document); } } indexWriter.optimize(); indexWriter.close(); long endTime = new Date().getTime(); System.out.println("It takes " + (endTime - startTime) + " milliseconds to create index for the files in directory " + dataDir.getPath()); } }

在清�?中，我们注意到类 IndexWriter 的构造函数需要三个参敎ͼ��W�一个参数指定了所创徏的烦引要存放的位�|�，他可以是一�?File 对象�Q?br /> 也可以是一�?FSDirectory 对象或�?RAMDirectory 对象�?br /> �W�二个参数指定了 Analyzer �cȝ��一个实玎ͼ�也就是指定这个烦引是用哪个分词器�Ҏ��挡内容进行分词�?br /> �W�三个参数是一个布��型的变量，如果�?true 的话��׃��表创��Z��个新的烦引，�?false 的话��׃��表在原来索引的基��上进行操作�?br /> 接着�E�序遍历了目录下面的所有文本文档，�q��ؓ每一个文本文档创��Z��一�?Document 对象�?br /> 然后把文本文档的两个属性：路径和内容加入到了两�?Field 对象中，接着在把�q�两�?Field 对象加入�?Document 对象中，
最后把�q�个文档�?IndexWriter �cȝ�� add �Ҏ��加入到烦引中厅R��这��h��们便完成了烦引的创徏。接下来我们�q�入在徏立好的烦引上�q�行搜烦的部分�?/p>

搜烦文档

利用Lucene�q�行搜烦��像建立索引一样也是非常方便的。在上面一部分中，我们已经��Z��个目录下的文本文档徏立好了烦引，现在我们��p��在这个烦引上�q�行搜烦以找到包含某个关键词或短语的文档。Lucene提供了几个基��的类来完成这个过�E�，它们分别是呢IndexSearcher, Term, Query, TermQuery, Hits. 下面我们分别介绍�q�几个类的功能�?/p>
Query

�q�是一个抽象类�Q�他有多个实玎ͼ�比如TermQuery, BooleanQuery, PrefixQuery. �q�个�cȝ��目的是把用户输入的查询字�W�串��装成Lucene能够识别的Query�?/p>
Term

Term是搜索的基本单位�Q�一个Term对象有两个String�c�d��的域�l�成。生成一个Term对象可以有如下一条语句来完成�Q�Term term = new Term(“fieldName”,”queryWord”); 其中�W�一个参��C��表了要在文档的哪一个Field上进行查找，�W�二个参��C��表了要查询的关键词�?/p>
TermQuery

TermQuery是抽象类Query的一个子�c�，它同时也是Lucene支持的最为基本的一个查询类。生成一个TermQuery对象由如下语句完成： TermQuery termQuery = new TermQuery(new Term(“fieldName”,”queryWord”)); 它的构造函数只接受一个参敎ͼ�那就是一个Term对象�?/p>
IndexSearcher

IndexSearcher是用来在建立好的索引上进行搜索的。它只能以只�ȝ��方式打开一个烦引，所以可以有多个IndexSearcher的实例在一个烦引上�q�行操作�?/p>
Hits

Hits是用来保存搜索的�l�果的�?/p>
介绍完这些搜索所必须的类之后�Q�我们就开始在之前所建立的烦引上�q�行搜烦了，清单2�l�出了完成搜索功能所需要的代码�?/p>
清单2 �Q�在建立好的索引上进行搜�?/strong>

package TestLucene; import java.io.File; import org.apache.lucene.document.Document; import org.apache.lucene.index.Term; import org.apache.lucene.search.Hits; import org.apache.lucene.search.IndexSearcher; import org.apache.lucene.search.TermQuery; import org.apache.lucene.store.FSDirectory; /** * This class is used to demonstrate the * process of searching on an existing * Lucene index * */ public class TxtFileSearcher { public static void main(String[] args) throws Exception{ String queryStr = "lucene"; //This is the directory that hosts the Lucene index File indexDir = new File("D:\\luceneIndex"); FSDirectory directory = FSDirectory.getDirectory(indexDir,false); IndexSearcher searcher = new IndexSearcher(directory); if(!indexDir.exists()){ System.out.println("The Lucene index is not exist"); return; } Term term = new Term("contents",queryStr.toLowerCase()); TermQuery luceneQuery = new TermQuery(term); Hits hits = searcher.search(luceneQuery); for(int i = 0; i < hits.length(); i++){ Document document = hits.doc(i); System.out.println("File: " + document.get("path")); } } }

在清�?中，�c�IndexSearcher的构造函数接受一个类型�ؓDirectory的对象，Directory是一个抽象类�Q�它目前有两个子�c�：FSDirctory和RAMDirectory. 我们的程序中传入了一个FSDirctory对象作�ؓ其参敎ͼ�代表了一个存储在��盘上的索引的位�|�。构造函数执行完成后�Q�代表了�q�个IndexSearcher以只�ȝ��方式打开了一个烦引。然后我们程序构造了一个Term对象�Q�通过�q�个Term对象�Q�我们指定了要在文档的内容中搜烦包含关键�?#8221;lucene”的文档。接着利用�q�个Term对象构造出TermQuery对象�q�把�q�个TermQuery对象传入到IndexSearcher的search�Ҏ��中进行查询，�q�回的结果保存在Hits对象中。最后我们用了一个��@环语句把搜烦到的文档的�\径都打印了出来。好了，我们的搜索应用程序已�l�开发完毕，怎么��P��利用Lucene开发搜索应用程序是不是很简单�?/p>

转蝲地址�Q�http://www-128.ibm.com/developerworks/cn/java/j-lo-lucene1/

马光�?/a> 2008-11-01 17:57 发表评论

欧美激情喷水,伊人av综合网,91视频一区二区三区