国产99精品,国产午夜精品美女毛片视频,av在线三区

lucene查询一个简单的例子

梓枫 — Fri, 12 Dec 2008 07:48:00 GMT

摘要: lucene查询一个简单的例子阅读全文

梓枫 2008-12-12 15:48 发表评论

Lucene搜烦引擎API的主要类介绍

梓枫 — Tue, 07 Oct 2008 03:16:00 GMT

Lucene搜烦的api的类主要�?�?IndexSearcher ,Query�Q�包括子�c�）,QueryParser,Hits
一:IndexSearcher是搜索的入口�Q�他的search�Ҏ��提供了搜索功�?/strong>
Query有很多子�c�，各种不同的子�c�M��表了不同的查询条�?下文详述
QueryParser是一个非帔R��用的帮助类�Q�他的作用是把用戯��入的文本转换为内�|�的Query对象�Q�大多数web搜烦引擎都提供一个查询输入框来让用户输入查询条�g�Q�。QueryParser内置提供了很多语法来使��用可以输入各�U�高�U�条件的Query。比�? "Hello AND world"会被解析��Z��个AND关系的BooleanQuery�Q�他包含两个TermQuery(Hell和world)。这些语法虽然强大，但都针对英文设计�Q�对我们需要中文搜索来说都不需要了解太多的Query�c�d��Q�一般几个简单的��够用了。QueryParser的��用如�?br />QueryParser.parse(String query, String field, Analyzer analyzer) throws ParseException
其中�Q�query是用戯��入的内容,field是搜索默认的field�Q�其他field需要显式指定）�Q�analyzer是用来将用户输入的内容也作分析处理（分词�Q�，一般情况下�q�里的anaylyzer是index的时候采用的同一analyzer�?br />另外我们也可以自己构造一个QueryParser: new QueryParser(String field, Analyzer a)(含义同上),�q�样做的好处是可以自己定义调整一些参�?
搜烦�l�果的处�?Hits对象
Hits对象是搜索结果的集合主要有下面几个方�?
length() ,�q�个�Ҏ��记录有多��条�l�果�q�回(lazy loading)
doc(n) �q�回�W�n个记�?
id(in) �q�回�W�n个记录的Document ID
score(n) �W�n个记录的相关�?�U�分)
�׃��搜烦的结果一般比较大�Q�从性能上考虑�Q�Hits对象�q�不会真正把所有的�l�果全部取回�Q�默认情况下是保留前100个记�?对于一般的搜烦引擎,100个记录��够了).
分页的处�?br />100条记录还是太多，我们多半会每��|��C?0条记录，然后分�ؓ若干��|��C�，对于分页�Q�一般有两个办法
在session中保留indexreader对象和hit对象�Q�翻��늚�时候提取内�?
不��用session�Q�每�ơ都��单处理�ؓ重新查询
lucene推荐先��用第二个办法�Q�即每次都重新查询，�q�样做的好处是简单方便，不需要考虑session的问题，lucene的查询效率也能保证每�ơ查询时间不长，除非真正有了性能问题�Q�否则不用考虑�W�一个办法�?br />�~�存�Q�RAMDirectory的用�?br />RAMDirectory对象很好用，通过它，我们可以把一个普通的index完全��d��到内存中,用法如下�Q?br />RAMDirectory ramDir = new RAMDirectory(dir);
�q�样的ramdir效率自然比真正的文�g�pȝ��快很�?br />Lucene的scoring��法
lucence查询的纪录默认按照相兛_��排序�Q�这个相兛_��是score,scoring的算法是比较复杂�?对于我们做应用的��Z��乎没有什么帮助，�Q�先说一下Term: 我的理解是Term��Z��个独立的查询�?用户输入的的查询通过各种分词�Q�大��写处理(正规�?,消除stopwords�{�）以后�Q�会已Term为基本单位）�Q�几个关键参数稍微留意一下即可�?br />Term在文章中出现的频率量
包含同一个Term的文章的频率
field中的boosting参数
term的长�?
term在文章中的数�?br />一般来�?�q�些参数我们都不可能去调�? 如果你想了解更多,IndexSearcher�q�提供了一个explain�Ҏ��, 通过传入一个Query和document ID,你可以得��C��个Explaination对象,他是对内部算法信息的��单封�?toString()一下就可以看到详细的说�?
�?创徏Query:各种query介绍
最普通的TermQuery
TermQuery最普�? 用Term t=new Term("contents","cap"); new TermQuery(t)��可以构�?br />TermQuery把查询条件视��Z��个key, 要求和查询内容完全匹�?比如Field.Keyword�c�d��可以��用TermQuery
RangeQuery
RangeQuery表示一个范围的搜烦条�g,RangeQuery query = new RangeQuery(begin, end, included);
最后一个boolean��D��C�是否包含边界条件本�w? 用字�W�表�C�Zؓ"[begin TO end]" 或�?{begin TO end}"
PrefixQuery
��֐�思义,��是表示以某某开头的查询, 字符表示�?something*"
BooleanQuery
�q�个是一个组合的Query,你可以把各种Query��d��q�去�q�标明他们的逻辑关系,��d��条�g�?br />public void add(Query query, boolean required, boolean prohibited)
�Ҏ��, 后两个boolean变量是标�C�AND or NOT三种关系字符表示�? AND or NOT" �?"+ -" ,一个BooleanQuery中可以添加多个Query, 如果��过setMaxClauseCount(int)的�?默认1024�?的话,会抛出TooManyClauses错误.
PhraseQuery
表示不严��D��句的查询,比如"red pig"要匹�?red fat pig","red fat big pig"�{?PhraseQuery所以提供了一个setSlop()参数,在查询中,lucene会尝试调整单词的距离和位�|?�q�个参数表示可以接受调整�ơ数限制,如果实际的内容可以在�q�么多步内调整�ؓ完全匚w��,那么��p��视�ؓ匚w��.在默认情况下slop的值是0, 所以默认是不支持非严格匚w��? 通过讄��slop参数(比如"red pig"匚w��"red fat pig"��需�?个slop来把pig后移�?�?,我们可以让lucene来模�p�查�? 值得注意的是,PhraseQuery不保证前后单词的�ơ序,在上面的例子�?"pig red"需�?个slop,也就是如果slop如果大于�{�于2,那么"pig red"也会被认为是匚w��?
WildcardQuery
使用?�?来表�C�Z��个或多个字母比如wil*可以匚w�� wild ,wila ,wilxaaaa...,值得注意的是,在wildcard�?只要是匹配上的纪�?他们的相兛_��都是一��L��,比如wilxaaaa和wild的对于wil*的相兛_��是一��L��.
FuzzyQuery
�q�个Query对中文没有什么用�?他能模糊匚w��英文单词(前面的都是词�l?,比如fuzzy和wuzzy他们可以看成�c�M��, 对于英文的各�U�时态变化和复数形式,�q�个FuzzyQuery�q�算有用,匚w��l�果的相兛_��是不一��L��.字符表示�?"fuzzy~"
�?QueryParser使用
对于搜烦引擎, 很多情况下用户只需要一个输入框��p��输入所有的查询条�g(比如google), �q�时,QueryParser��派上用��Z��,他的作用��是把各�U�用戯��入�{为Query或者Query�l? 他把上面提到的Query的字�W�表�C?Query.toString)转化为实际的Query对象,比如"wuzzy~"��׃��转换为FuzzyQuery, 不过QueryParser用到了Analyzer,所以QueryParser parse�q�后的Query再toString未必和原来的一�?Query额外的语法有:
分组:Groupping
比如"(a AND b) or C",��是括号分组,很容易理�?br />FieldSelectiong
QueryParser的查询条件是寚w��认的Field�q�行�? 它在QueryParser解析的时候编码指�? 如果用户需要在查询条�g中选用另外的Field, 可以使用如下语法: fieldname:fielda, 如果是多个分�l?可以用fieldname:(fielda fieldb fieldc)表示.
*号问�?br />QueryParse默认不允�?号出现在开始部分，�q�样做的目的主要是�ؓ了防止用戯��输入*来头��D��严重的性能问题�Q�会把所有记录读出）
boosting
通过hello^2.0 可以对hello�q�个term�q�行boosting�Q?我想不到什么用户会�q�样么bt)
QueryParser是一个准备好�?立即可以工作的帮助类,不过他还是提供了很多参数供程序员调整�Q�首�?我们需要自己构造一个新的QueryParser,然后对他的各�U�参数来定制�?

梓枫 2008-10-07 11:16 发表评论

梓枫 — Mon, 22 Sep 2008 02:17:00 GMT

本文主要�l�合��试案例介绍了Lucene下的各种查询语句以及它们的简化方�?
通过本文你将了解Lucene的基本查询语�?�q�可以学习所有的��试代码已加��Z��?

具体的查询语�?/strong>

在了解了SQL�? 你是否想了解一下查询语法树?在这里简要介�l�一些能被Lucene直接使用的查询语�?

1.         TermQuery
查询某个特定的词,在文章开始的例子中已有介�l?常用于查询关键字.

             [Test]
         public void Keyword()
         {
              IndexSearcher searcher = new IndexSearcher(directory);
              Term t = new Term("isbn", "1930110995");
              Query query = new TermQuery(t);
              Hits hits = searcher.Search(query);
              Assert.AreEqual(1, hits.Length(), "JUnit in Action");
         }

注意Lucene中的关键�?是需要用户去保证唯一性的.

TermQuery和QueryParse

只要在QueryParse的Parse�Ҏ��中只有一个word,��׃��自动转换成TermQuery.

2.         RangeQuery
用于查询范围,通常用于旉��,�q�是来看例子:

namespace dotLucene.inAction.BasicSearch
{
     public class RangeQueryTest : LiaTestCase
     {
         private Term begin, end;

         [SetUp]
         protected override void Init()
         {
              begin = new Term("pubmonth", "200004");

              end = new Term("pubmonth", "200206");
              base.Init();
         }

         [Test]
         public void Inclusive()
         {
              RangeQuery query = new RangeQuery(begin, end, true);
              IndexSearcher searcher = new IndexSearcher(directory);

              Hits hits = searcher.Search(query);
              Assert.AreEqual(1, hits.Length());
         }

         [Test]
         public void Exclusive()
         {
              RangeQuery query = new RangeQuery(begin, end, false);
              IndexSearcher searcher = new IndexSearcher(directory);

              Hits hits = searcher.Search(query);
              Assert.AreEqual(0, hits.Length());
         }

     }
}

RangeQuery的第三个参数用于表示是否包含该�v止日�?

RangeQuery �?/strong> QueryParse

              [Test]
         public void TestQueryParser()
         {
              Query query = QueryParser.Parse("pubmonth:[200004 TO 200206]", "subject", new SimpleAnalyzer());
              Assert.IsTrue(query is RangeQuery);
              IndexSearcher searcher = new IndexSearcher(directory);
              Hits hits = searcher.Search(query);

              query = QueryParser.Parse("{200004 TO 200206}", "pubmonth", new SimpleAnalyzer());
              hits = searcher.Search(query);
              Assert.AreEqual(0, hits.Length(), "JDwA in 200206");
         }

Lucene用[] 和{}分别表示包含和不包含.

3.    PrefixQuery

用于搜烦是否包含某个特定前缀,常用于Catalog的检�?

           [Test]
         public void TestPrefixQuery()
         {
              PrefixQuery query = new PrefixQuery(new Term("category", "/Computers"));

             IndexSearcher searcher = new IndexSearcher(directory);
              Hits hits = searcher.Search(query);
              Assert.AreEqual(2, hits.Length());

              query = new PrefixQuery(new Term("category", "/Computers/JUnit"));
              hits = searcher.Search(query);
              Assert.AreEqual(1, hits.Length(), "JUnit in Action");
         }

PrefixQuery和QueryParse

            [Test]
         public void TestQueryParser()
         {

              QueryParser qp = new QueryParser("category", new SimpleAnalyzer());
              qp.SetLowercaseWildcardTerms(false);
              Query query =qp.Parse("/Computers*");
              Console.Out.WriteLine("query = {0}", query.ToString());
              IndexSearcher searcher = new IndexSearcher(directory);
              Hits hits = searcher.Search(query);
              Assert.AreEqual(2, hits.Length());
              query =qp.Parse("/Computers/JUnit*");
              hits = searcher.Search(query);
              Assert.AreEqual(1, hits.Length(), "JUnit in Action");
         }

�q�里需要注意的是我们��用了QueryParser对象,而不是QueryParser�c? 原因在于使用对象可以对QueryParser的一些默认属性进行修�?比如在上面的例子中我们的category是大写的,而QueryParser默认会把所有的�?的查询字�W�串变成��写/computer*. �q�样我们��׃��查不到原文中�?Computers* ,所以我们需要通过讄��QueryParser的默认属性来改变�q�一默认选项.即qp.SetLowercaseWildcardTerms(false)所做的工作.

4.     BooleanQuery

用于��试满��多个条�g.

下面两个例子用于分别��试了满��与条�g和或条�g的情�?

         [Test]
         public void And()
         {
              TermQuery searchingBooks =
                   new TermQuery(new Term("subject", "junit"));

              RangeQuery currentBooks =
                   new RangeQuery(new Term("pubmonth", "200301"),
                                  new Term("pubmonth", "200312"),
                                  true);
              BooleanQuery currentSearchingBooks = new BooleanQuery();
              currentSearchingBooks.Add(searchingBooks, true, false);
              currentSearchingBooks.Add(currentBooks, true, false);
              IndexSearcher searcher = new IndexSearcher(directory);
              Hits hits = searcher.Search(currentSearchingBooks);

              AssertHitsIncludeTitle(hits, "JUnit in Action");
         }
         [Test]
         public void Or()
         {
              TermQuery methodologyBooks = new TermQuery(
                   new Term("category",
                            "/Computers/JUnit"));
              TermQuery easternPhilosophyBooks = new TermQuery(
                   new Term("category",
                            "/Computers/Ant"));
              BooleanQuery enlightenmentBooks = new BooleanQuery();
              enlightenmentBooks.Add(methodologyBooks, false, false);
              enlightenmentBooks.Add(easternPhilosophyBooks, false, false);
              IndexSearcher searcher = new IndexSearcher(directory);
              Hits hits = searcher.Search(enlightenmentBooks);
              Console.Out.WriteLine("or = " + enlightenmentBooks);
              AssertHitsIncludeTitle(hits, "Java Development with Ant");
              AssertHitsIncludeTitle(hits, "JUnit in Action");

         }

什么时候是与什么时候又是或? 关键在于BooleanQuery对象的Add�Ҏ��的参�?

参数一是待��d��的查询条�?

参数二Required表示�q�个条�g必须满��? True表示必须满��, False表示可以不满��条�g.

参数三Prohibited表示�q�个条�g必须拒绝�? True表示�q�么满��q�个条�g的结果要排除, False表示可以满��该条�?

�q�样会有三种�l�合情况,如下表所�C?

BooleanQuery �?/strong> QueryParse

         [Test]
         public void TestQueryParser()
         {
              Query query = QueryParser.Parse("pubmonth:[200301 TO 200312] AND junit", "subject", new SimpleAnalyzer());
              IndexSearcher searcher = new IndexSearcher(directory);
              Hits hits = searcher.Search(query);
              Assert.AreEqual(1, hits.Length());
              query = QueryParser.Parse("/Computers/JUnit OR /Computers/Ant", "category", new WhitespaceAnalyzer());
              hits = searcher.Search(query);
              Assert.AreEqual(2, hits.Length());
         }

注意AND和OR的大��?如果惌��Q�与非B ��q�� A AND –B 表示, +A –B也可�?

默认的情况下QueryParser会把�I�格认�ؓ是或关系,��p��google一�?但是你可以通过QueryParser对象修改�q�一属�?

[Test]
         public void TestQueryParserDefaultAND()
         {
              QueryParser qp = new QueryParser("subject", new SimpleAnalyzer());
              qp.SetOperator(QueryParser.DEFAULT_OPERATOR_AND );
              Query query = qp.Parse("pubmonth:[200301 TO 200312] junit");
              IndexSearcher searcher = new IndexSearcher(directory);
              Hits hits = searcher.Search(query);
              Assert.AreEqual(1, hits.Length());

         }
5.         PhraseQuery
查询短语,�q�里面主要有一个slop的概�? 也就是各个词之间的位�U�d��? �q�个��g��影响到结果的评分.如果slop�?,当然最匚w��.看看下面的例子就比较�Ҏ��明白�?有关slop的计��用户就不需要理解了,不过slop太大的时候对查询效率是有影响�?所以在实际使用中要把该��D��一�? PhraseQuery对于短语的顺序是不管�?�q�点在查询时除了提高命中率外,也会�Ҏ��能产生很大的媄�? 利用SpanNearQuery可以对短语的��序�q�行控制,提高性能.
      [SetUp]
     protected void Init()
     {
         // set up sample document
         RAMDirectory directory = new RAMDirectory();
         IndexWriter writer = new IndexWriter(directory,
                                              new WhitespaceAnalyzer(), true);
         Document doc = new Document();
         doc.Add(Field.Text("field",
                            "the quick brown fox jumped over the lazy dog"));
         writer.AddDocument(doc);
         writer.Close();

         searcher = new IndexSearcher(directory);
     }
      private bool matched(String[] phrase, int slop)
     {
         PhraseQuery query = new PhraseQuery();
         query.SetSlop(slop);

         for (int i = 0; i < phrase.Length; i++)
         {
              query.Add(new Term("field", phrase[i]));
         }

         Hits hits = searcher.Search(query);
         return hits.Length() > 0;
     }

     [Test]
     public void SlopComparison()
     {
         String[] phrase = new String[]{"quick", "fox"};

         Assert.IsFalse(matched(phrase, 0), "exact phrase not found");

         Assert.IsTrue(matched(phrase, 1), "close enough");
     }

     [Test]
     public void Reverse()
     {
         String[] phrase = new String[] {"fox", "quick"};

         Assert.IsFalse(matched(phrase, 2), "exact phrase not found");

         Assert.IsTrue(matched(phrase, 3), "close enough");
     }

     [Test]
     public void Multiple()-
     {
         Assert.IsFalse(matched(new String[] {"quick", "jumped", "lazy"}, 3), "not close enough");
         Assert.IsTrue(matched(new String[] {"quick", "jumped", "lazy"}, 4), "just enough");
         Assert.IsFalse(matched(new String[] {"lazy", "jumped", "quick"}, 7), "almost but not quite");
         Assert.IsTrue(matched(new String[] {"lazy", "jumped", "quick"}, 8), "bingo");
     }

PhraseQuery和QueryParse

利用QueryParse�q�行短语查询的时候要先设定slop的�?有两�U�方式如下所�C?/p>
[Test]
     public void TestQueryParser()
     {
         Query q1 = QueryParser.Parse(""quick fox"",
              "field", new SimpleAnalyzer());
         Hits hits1 = searcher.Search(q1);
         Assert.AreEqual(hits1.Length(), 0);

         Query q2 = QueryParser.Parse(""quick fox"~1",          //�W�一�U�方�?br />                                     "field", new SimpleAnalyzer());
         Hits hits2 = searcher.Search(q2);
         Assert.AreEqual(hits2.Length(), 1);

         QueryParser qp = new QueryParser("field", new SimpleAnalyzer());
         qp.SetPhraseSlop(1);                                    //�W�二�U�方�?br />         Query q3=qp.Parse(""quick fox"");
         Assert.AreEqual(""quick fox"~1", q3.ToString("field"),"sloppy, implicitly");
         Hits hits3 = searcher.Search(q2);
         Assert.AreEqual(hits3.Length(), 1);
     }

6.         WildcardQuery
通配�W�搜�?需要注意的是child, mildew的分值是一��L��.
         [Test]
         public void Wildcard()
         {
              IndexSingleFieldDocs(new Field[]
                   {
                       Field.Text("contents", "wild"),
                       Field.Text("contents", "child"),
                       Field.Text("contents", "mild"),
                       Field.Text("contents", "mildew")
                   });
              IndexSearcher searcher = new IndexSearcher(directory);
              Query query = new WildcardQuery(
                   new Term("contents", "?ild*"));
              Hits hits = searcher.Search(query);
              Assert.AreEqual(3, hits.Length(), "child no match");
              Assert.AreEqual(hits.Score(0), hits.Score(1), 0.0, "score the same");
              Assert.AreEqual(hits.Score(1), hits.Score(2), 0.0, "score the same");
         }
WildcardQuery和QueryParse
需要注意的是出于性能的考虑使用QueryParse的时�?不允许在开头就使用��׃��用通配�W?
同样处于性能考虑会将只在末尾含有*的查询词转换为PrefixQuery.
         [Test, ExpectedException(typeof (ParseException))]
         public void TestQueryParserException()
         {
              Query query = QueryParser.Parse("?ild*", "contents", new WhitespaceAnalyzer());
         }

         [Test]
         public void TestQueryParserTailAsterrisk()
         {
              Query query = QueryParser.Parse("mild*", "contents", new WhitespaceAnalyzer());
              Assert.IsTrue(query is PrefixQuery);
              Assert.IsFalse(query is WildcardQuery);

         }

         [Test]
         public void TestQueryParser()
         {
              Query query = QueryParser.Parse("mi?d*", "contents", new WhitespaceAnalyzer());
              Hits hits = searcher.Search(query);
              Assert.AreEqual(2, hits.Length());
         }
7.         FuzzyQuery
模糊查询, 需要注意的是两个匹配项的分值是不同�?�q�点和WildcardQuery是不同的

         [Test]
         public void Fuzzy()
         {
              Query query = new FuzzyQuery(new Term("contents", "wuzza"));
              Hits hits = searcher.Search(query);
              Assert.AreEqual( 2, hits.Length(),"both close enough");
              Assert.IsTrue(hits.Score(0) != hits.Score(1),"wuzzy closer than fuzzy");
              Assert.AreEqual("wuzzy", hits.Doc(0).Get("contents"),"wuzza bear");
         }

FuzzyQuery和QueryParse

注意和PhraseQuery中表�C�slop的区�?前者~后要跟数�?

         [Test]
         public void TestQueryParser()
         {
              Query query =QueryParser.Parse("wuzza~","contents",new SimpleAnalyzer());
              Hits hits = searcher.Search(query);
              Assert.AreEqual( 2, hits.Length(),"both close enough");
         }

梓枫 2008-09-22 10:17 发表评论

�W?1 部分: 初识 Lucene

梓枫 — Wed, 17 Sep 2008 04:10:00 GMT

本文首先介绍了Lucene的一些基本概念，然后开发了一个应用程序演�C�Z��利用Lucene建立索引�q�在该烦引上�q�行搜烦的过�E��?/blockquote>
Lucene ��?/span>

Lucene 是一个基�?Java 的全文信息检索工具包�Q�它不是一个完整的搜烦应用�E�序�Q�而是��Z��的应用程序提供烦引和搜烦功能。Lucene 目前�?Apache Jakarta 家族中的一个开源项目。也是目前最为流行的��Z�� Java 开源全文检索工具包�?/p>
目前已经有很多应用程序的搜烦功能是基�?Lucene 的，比如 Eclipse 的帮助系�l�的搜烦功能。Lucene 能够为文本类型的数据建立索引�Q�所以你只要能把你要索引的数据格式�{化的文本的，Lucene ��p��对你的文档进行烦引和搜烦。比如你要对一�?HTML 文档�Q�PDF 文��q�行索引的话你就首先需要把 HTML 文档�?PDF 文��转化成文本格式的�Q�然后将转化后的内容交给 Lucene �q�行索引�Q�然后把创徏好的索引文�g保存到磁盘或者内存中�Q�最后根据用戯��入的查询条�g在烦引文件上�q�行查询。不指定要烦引的文��的格式也�?Lucene 能够几乎适用于所有的搜烦应用�E�序�?/p>
�?1 表示了搜索应用程序和 Lucene 之间的关�p�，也反映了利用 Lucene 构徏搜烦应用�E�序的流�E�：

�?. 搜烦应用�E�序�?Lucene 之间的关�p?/b>

索引和搜�?/span>
索引是现代搜索引擎的核心�Q�徏立烦引的�q�程��是把源数据处理成非常方便查询的索引文�g的过�E�。�ؓ什么烦引这么重要呢�Q�试想你现在要在大量的文档中搜烦含有某个关键词的文��Q�那么如果不建立索引的话你就需要把�q�些文��序的读入内存，然后��查这个文章中是不是含有要查找的关键词�Q�这��L��话就会耗费非常多的旉��Q�想��x��索引擎可是在毫秒�U�的旉��内查扑և�要搜索的�l�果的。这��是�׃��建立了烦引的原因�Q�你可以把烦引想象成�q�样一�U�数据结构，他能够��你快速的随机讉K��存储在烦引中的关键词�Q�进而找到该关键词所兌��的文��。Lucene 采用的是一�U�称为反向烦引（inverted index�Q�的机制。反向烦引就是说我们�l�护了一个词/短语表，对于�q�个表中的每个词/短语�Q�都有一个链表描�q�C��有哪些文��包含了�q�个�?短语。这样在用户输入查询条�g的时候，��p��非常快的得到搜烦�l�果。我们将在本�p�d��文章的第二部分详�l�介�l?Lucene 的烦引机�Ӟ��׃�� Lucene 提供了简单易用的 API�Q�所以即使读者刚开始对全文本进行烦引的机制�q�不太了解，也可以非常容易的使用 Lucene 对你的文��实现烦引�?/p>
�Ҏ��徏立好索引后，��可以在�q�些索引上面�q�行搜烦了。搜索引擎首先会�Ҏ��索的关键词进行解析，然后再在建立好的索引上面�q�行查找�Q�最�l�返回和用户输入的关键词相关联的文档�?/p>

Lucene 软�g包分�?/span>
Lucene 软�g包的发布形式是一�?JAR 文�g�Q�下面我们分析一下这�?JAR 文�g里面的主要的 JAVA 包，使读者对之有个初步的了解�?/p>
Package: org.apache.lucene.document
�q�个包提供了一些�ؓ��装要烦引的文��所需要的�c�，比如 Document, Field。这��P��每一个文��最�l�被��装成了一�?Document 对象�?/p>
Package: org.apache.lucene.analysis
�q�个包主要功能是�Ҏ��进行分词，因�ؓ文档在徏立烦引之前必��要�q�行分词�Q�所以这个包的作用可以看成是为徏立烦引做准备工作�?/p>
Package: org.apache.lucene.index
�q�个包提供了一些类来协助创建烦引以及对创徏好的索引�q�行更新。这里面有两个基��的类�Q�IndexWriter �?IndexReader�Q�其�?IndexWriter 是用来创建烦引�ƈ��d��文��到烦引中的，IndexReader 是用来删除烦引中的文档的�?/p>
Package: org.apache.lucene.search
�q�个包提供了对在建立好的索引上进行搜索所需要的�c�R��比�?IndexSearcher �?Hits, IndexSearcher 定义了在指定的烦引上�q�行搜烦的方法，Hits 用来保存搜烦得到的结果�?/p>

一个简单的搜烦应用�E�序
假设我们的电脑的目录中含有很多文本文��，我们需要查扑֓�些文��含有某个关键词。�ؓ了实现这�U�功能，我们首先利用 Lucene 对这个目录中的文��徏立烦引，然后在徏立好的烦引中搜烦我们所要查扄��文档。通过�q�个例子读者会对如何利�?Lucene 构徏自己的搜索应用程序有个比较清楚的认识�?/p>

建立索引
��Z��Ҏ��进行烦引，Lucene 提供了五个基��的类�Q�他们分别是 Document, Field, IndexWriter, Analyzer, Directory。下面我们分别介�l�一下这五个�cȝ��用途：
Document
Document 是用来描�q�文��的�Q�这里的文档可以指一�?HTML ��面�Q�一��电子邮�Ӟ��或者是一个文本文件。一�?Document 对象由多�?Field 对象�l�成的。可以把一�?Document 对象惌��成数据库中的一个记录，而每�?Field 对象��是记录的一个字�D�c�?/p>
Field
Field 对象是用来描�q�C��个文��的某个属性的�Q�比如一��电子邮件的标题和内容可以用两个 Field 对象分别描述�?/p>
Analyzer
在一个文档被索引之前�Q�首先需要对文��内容�q�行分词处理�Q�这部分工作��是�?Analyzer 来做的。Analyzer �c�L��一个抽象类�Q�它有多个实现。针对不同的语言和应用需要选择适合�?Analyzer。Analyzer 把分词后的内容交�l?IndexWriter 来徏立烦引�?/p>
IndexWriter
IndexWriter �?Lucene 用来创徏索引的一个核心的�c�，他的作用是把一个个�?Document 对象加到索引中来�?/p>
Directory
�q�个�c�M��表了 Lucene 的烦引的存储的位�|�，�q�是一个抽象类�Q�它目前有两个实玎ͼ��W�一个是 FSDirectory�Q�它表示一个存储在文�g�pȝ��中的索引的位�|�。第二个�?RAMDirectory�Q�它表示一个存储在内存当中的烦引的位置�?/p>
熟悉了徏立烦引所需要的�q�些�c�d��Q�我们就开始对某个目录下面的文本文件徏立烦引了�Q�清�?�l�出了对某个目录下的文本文�g建立索引的源代码�?/p>
清单 1. �Ҏ��本文件徏立烦�?/b>
package TestLucene; import java.io.File; import java.io.FileReader; import java.io.Reader; import java.util.Date; import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; import org.apache.lucene.index.IndexWriter; /** * This class demonstrate the process of creating index with Lucene * for text files */ public class TxtFileIndexer { public static void main(String[] args) throws Exception{ //indexDir is the directory that hosts Lucene's index files File indexDir = new File("D:\\luceneIndex"); //dataDir is the directory that hosts the text files that to be indexed File dataDir = new File("D:\\luceneData"); Analyzer luceneAnalyzer = new StandardAnalyzer(); File[] dataFiles = dataDir.listFiles(); IndexWriter indexWriter = new IndexWriter(indexDir,luceneAnalyzer,true); long startTime = new Date().getTime(); for(int i = 0; i < dataFiles.length; i++){ if(dataFiles[i].isFile() && dataFiles[i].getName().endsWith(".txt")){ System.out.println("Indexing file " + dataFiles[i].getCanonicalPath()); Document document = new Document(); Reader txtReader = new FileReader(dataFiles[i]); document.add(Field.Text("path",dataFiles[i].getCanonicalPath())); document.add(Field.Text("contents",txtReader)); indexWriter.addDocument(document); } } indexWriter.optimize(); indexWriter.close(); long endTime = new Date().getTime(); System.out.println("It takes " + (endTime - startTime) + " milliseconds to create index for the files in directory " + dataDir.getPath()); } }

在清�?中，我们注意到类 IndexWriter 的构造函数需要三个参敎ͼ��W�一个参数指定了所创徏的烦引要存放的位�|�，他可以是一�?File 对象�Q�也可以是一�?FSDirectory 对象或�?RAMDirectory 对象。第二个参数指定�?Analyzer �cȝ��一个实玎ͼ�也就是指定这个烦引是用哪个分词器�Ҏ��挡内容进行分词。第三个参数是一个布��型的变量，如果�?true 的话��׃��表创��Z��个新的烦引，�?false 的话��׃��表在原来索引的基��上进行操作。接着�E�序遍历了目录下面的所有文本文��，�q��ؓ每一个文本文��创��Z��一�?Document 对象。然后把文本文��的两个属性：路径和内容加入到了两�?Field 对象中，接着在把�q�两�?Field 对象加入�?Document 对象中，最后把�q�个文��?IndexWriter �cȝ�� add �Ҏ��加入到烦引中厅R��这��h��们便完成了烦引的创徏。接下来我们�q�入在徏立好的烦引上�q�行搜烦的部分�?/p>

搜烦文��
利用Lucene�q�行搜烦��像建立索引一样也是非常方便的。在上面一部分中，我们已经��Z��个目录下的文本文��徏立好了烦引，现在我们��p��在这个烦引上�q�行搜烦以找到包含某个关键词或短语的文��。Lucene提供了几个基��的类来完成这个过�E�，它们分别是呢IndexSearcher, Term, Query, TermQuery, Hits. 下面我们分别介绍�q�几个类的功能�?/p>
Query
�q�是一个抽象类�Q�他有多个实玎ͼ�比如TermQuery, BooleanQuery, PrefixQuery. �q�个�cȝ��目的是把用户输入的查询字�W�串��装成Lucene能够识别的Query�?/p>
Term
Term是搜索的基本单位�Q�一个Term对象有两个String�c�d��的域�l�成。生成一个Term对象可以有如下一条语句来完成�Q�Term term = new Term(“fieldName�?”queryWord�?; 其中�W�一个参��C��表了要在文��的哪一个Field上进行查找，�W�二个参��C��表了要查询的关键词�?/p>
TermQuery
TermQuery是抽象类Query的一个子�c�，它同时也是Lucene支持的最为基本的一个查询类。生成一个TermQuery对象由如下语句完成： TermQuery termQuery = new TermQuery(new Term(“fieldName�?”queryWord�?); 它的构造函数只接受一个参敎ͼ�那就是一个Term对象�?/p>
IndexSearcher
IndexSearcher是用来在建立好的索引上进行搜索的。它只能以只�ȝ��方式打开一个烦引，所以可以有多个IndexSearcher的实例在一个烦引上�q�行操作�?/p>
Hits
Hits是用来保存搜索的�l�果的�?/p>
介绍完这些搜索所必须的类之后�Q�我们就开始在之前所建立的烦引上�q�行搜烦了，清单2�l�出了完成搜索功能所需要的代码�?/p>
清单2 �Q�在建立好的索引上进行搜�?/b>
package TestLucene; import java.io.File; import org.apache.lucene.document.Document; import org.apache.lucene.index.Term; import org.apache.lucene.search.Hits; import org.apache.lucene.search.IndexSearcher; import org.apache.lucene.search.TermQuery; import org.apache.lucene.store.FSDirectory; /** * This class is used to demonstrate the * process of searching on an existing * Lucene index * */ public class TxtFileSearcher { public static void main(String[] args) throws Exception{ String queryStr = "lucene"; //This is the directory that hosts the Lucene index File indexDir = new File("D:\\luceneIndex"); FSDirectory directory = FSDirectory.getDirectory(indexDir,false); IndexSearcher searcher = new IndexSearcher(directory); if(!indexDir.exists()){ System.out.println("The Lucene index is not exist"); return; } Term term = new Term("contents",queryStr.toLowerCase()); TermQuery luceneQuery = new TermQuery(term); Hits hits = searcher.search(luceneQuery); for(int i = 0; i < hits.length(); i++){ Document document = hits.doc(i); System.out.println("File: " + document.get("path")); } } }

在清�?中，�c�IndexSearcher的构造函数接受一个类型�ؓDirectory的对象，Directory是一个抽象类�Q�它目前有两个子�c�：FSDirctory和RAMDirectory. 我们的程序中传入了一个FSDirctory对象作�ؓ其参敎ͼ�代表了一个存储在��盘上的索引的位�|�。构造函数执行完成后�Q�代表了�q�个IndexSearcher以只�ȝ��方式打开了一个烦引。然后我们程序构造了一个Term对象�Q�通过�q�个Term对象�Q�我们指定了要在文��的内容中搜烦包含关键词”lucene”的文��。接着利用�q�个Term对象构造出TermQuery对象�q�把�q�个TermQuery对象传入到IndexSearcher的search�Ҏ��中进行查询，�q�回的结果保存在Hits对象中。最后我们用了一个��@环语句把搜烦到的文��的�\径都打印了出来。好了，我们的搜索应用程序已�l�开发完毕，怎么��P��利用Lucene开发搜索应用程序是不是很简单�?br />
�ȝ��
本文首先介绍�?Lucene 的一些基本概念，然后开发了一个应用程序演�C�Z��利用 Lucene 建立索引�q�在该烦引上�q�行搜烦的过�E�。希望本文能够�ؓ学习 Lucene 的读者提供帮助�?/p>

梓枫 2008-09-17 12:10 发表评论