JAVA—咖啡館

——歡迎訪問rogerfan的博客，常來《JAVA——咖啡館》坐坐，喝杯濃香的咖啡，彼此探討一下JAVA技術，交流工作經驗，分享JAVA帶來的快樂！本網站部分轉載文章，如果有版權問題請與我聯系。

BlogJava

管理

447 Posts :: 145 Stories :: 368 Comments :: 0 Trackbacks

【轉】Annotated Lucene：第三節索引是如何創建的

    // Store the index on disk
    Directory directory = FSDirectory.getDirectory("/tmp/testindex");
    // Use standard analyzer
    Analyzer analyzer = new StandardAnalyzer();
    // Create IndexWriter object
    IndexWriter iwriter = new IndexWriter(directory, analyzer, true);
    iwriter.setMaxFieldLength(25000);
    // make a new, empty document
    Document doc = new Document();
    File f = new File("/tmp/test.txt");
    // Add the path of the file as a field named "path".  Use a field that is
    // indexed (i.e. searchable), but don't tokenize the field into words.
    doc.add(new Field("path", f.getPath(), Field.Store.YES, Field.Index.UN_TOKENIZED));
    String text = "This is the text to be indexed.";
    doc.add(new Field("fieldname", text, Field.Store.YES,      Field.Index.TOKENIZED));
    // Add the last modified date of the file a field named "modified".  Use
    // a field that is indexed (i.e. searchable), but don't tokenize the field
    // into words.
    doc.add(new Field("modified",
        DateTools.timeToString(f.lastModified(), DateTools.Resolution.MINUTE),
        Field.Store.YES, Field.Index.UN_TOKENIZED));
    // Add the contents of the file to a field named "contents".  Specify a Reader,
    // so that the text of the file is tokenized and indexed, but not stored.
    // Note that FileReader expects the file to be in the system's default encoding.
    // If that's not the case searching for special characters will fail.
    doc.add(new Field("contents", new FileReader(f)));
    iwriter.addDocument(doc);
    iwriter.optimize();
    iwriter.close();

下面詳細介紹每一個類的處理機制。

4.1 索引創建類IndexWriter

一個IndexWriter對象創建并且維護(maintains) 一條索引.

它的構造函數(constructor)的create參數(argument)確定(determines)是否一條新的索引將被創建，或者是否一條已經存在的索引將被打開。需要注意的是你可以使用create=true參數打開一條索引，即使有其他readers也在在使用這條索引。舊的readers將繼續檢索它們已經打開的”point in time”快照（snapshot），并不能看見那些新已創建的索引，直到它們再次打開（re-open）。另外還有一個沒有create參數的構造函數，如果提供的目錄（provided path）中沒有已經存在的索引，它將創建它，否則將打開此存在的索引。

另一方面（in either case），添加文檔使用addDocument()方法，刪除文檔使用removeDocument()方法，而且一篇文檔可以使用updateDocument()方法來更新（僅僅是先執行delete在執行add操作而已）。當完成了添加、刪除、更新文檔，應該需要調用close方法。

這些修改會緩存在內存中（buffered in memory），并且定期地（periodically）刷新到（flush）Directory中（在上述方法的調用期間）。一次flush操作會在如下時候觸發（triggered）：當從上一次flush操作后有足夠多緩存的delete操作（參見setMaxBufferedDeleteTerms(int)），或者足夠多已添加的文檔（參見setMaxBufferedDocs(int)），無論哪個更快些（whichever is sooner）。當一次flush發生時，等待的（pending）delete和add文檔都會被flush到索引中。一次flush可能觸發一個或更多的片斷合并（segment merges）。

構造函數中的可選參數（optional argument）autoCommit控制（controls）修改對IndexReader實體（instance）讀取相同索引的能見度（visibility）。當設置為false時，修改操作將不可見（visible）直到close()方法被調用后。需要注意的是修改將依然被flush進Directory，就像新文件一樣（as new files），但是卻不會被提交（commit）（沒有新的引用那些新文件的segments_N文件會被寫入（written referencing the new files））直道close()方法被調用。如果在調用close()之前發生了某種嚴重錯誤（something goes terribly wrong）（例如JVM崩潰了），于是索引將反映（reflect）沒有任何修改發生過（none of changes made）（它將保留它開始的狀態（remain in its starting state））。你還可以調用close()，這樣可以關閉那些沒有提交任何修改操作的writers，并且清除所有那些已經flush但是現在不被引用的（unreferenced）索引文件。這個模式（mode）對防止（prevent）readers在一個錯誤的時間重新刷新（refresh）非常有用（例如在你完成所有delete操作后，但是在你完成添加操作前的時候）。它還能被用來實現簡單的single-writer的事務語義（transactional semantics）（"all or none"）。

當autoCommit設為true的時候，每次flush也會是一次提交（IndexReader實體將會把每次flush當作一次提交）。這是缺省的設置，目的是為了匹配（match）2.2版本之前的行為（behavior）。當以這種模式運行時，當優化（optimize）或者片斷合并（segment merges）正在進行（take place）的時候需要小心地重新刷新（refresh）你的readers，因為這兩個操作會綁定（tie up）可觀的（substantial）磁盤空間。

當一條索引暫時（for a while）將不會有更多的文檔被添加，并且期望（desired）得到最理想（optimal）的檢索性能（performance），于是optimize()方法應該在索引被關閉之前被調用。

打開IndexWriter會為使用的Directory創建一個lock文件。嘗試對相同的Directory打開另一個IndexWriter將會導致（lead to）一個LockObtainFailedException異常。如果一個建立在相同的Directory的IndexReader對象被用來從這條索引中刪除文檔的時候，這個異常也會被拋出。

專家（Expert）：IndexWriter允許指定（specify）一個可選的（optional）IndexDeletionPolicy實現。你可以通過這個控制什么時候優先的提交（prior commit）從索引中被刪除。缺省的策略（policy）是KeepOnlyLastCommitDeletionPolicy類，在一個新的提交完成的時候它會馬上所有的優先提交（prior commit）（這匹配2.2版本之前的行為）。創建你自己的策略能夠允許你明確地（explicitly）保留以前的”point in time”提交（commit）在索引中存在（alive）一段時間。為了讓readers刷新到新的提交，在它們之下沒有被刪除的舊的提交（without having the old commit deleted out from under them）。這對那些不支持“在最后關閉時才刪除”語義（”delete on last close” semantics）的文件系統（filesystem）如NFS，而這是Lucene的“point in time”檢索通常所依賴的（normally rely on）。

posted on 2010-06-21 09:58 rogerfan 閱讀(259) 評論(0) 編輯收藏所屬分類: 【開源技術】

新用戶注冊刷新評論列表


只有注冊用戶登錄后才能發表評論。




網站導航: 博客園 IT新聞 Chat2DB C++博客博問管理
相關文章: 【轉】微信公眾號開發之微信模板消息【轉】微信公眾平臺開發之模板消息(Java) 【轉】Memcached-1.4.4-14 For Win32 or Win64 安裝【轉】windows+nginx+memcached+tomcat做負載均衡【轉】windows任務定時重啟tomcat 【轉】CDN緩存那些事【轉】CAS實現SSO單點登錄原理【轉】CAS框架配置詳解【轉】nginx1.8.1(穩定版本) nginx.conf 配置文件詳解二【轉】nginx1.8.1(穩定版本) ngixn.conf 配置文件詳解一

JAVA—咖啡館

公告

常用鏈接

留言簿(17)

隨筆分類(542)

隨筆檔案(438)

文章分類(182)

文章檔案(142)

新聞分類

※→ 【JAVA文檔】

※→ 【親人博客】

※→ 【休閑娛樂】

※→ 【友情鏈接】

※→ 【學習網站】

※→ 【服務網站】

※→ 【著名網站】

※→ 【阿里博客】

最新隨筆

搜索

積分與排名

最新評論

閱讀排行榜

評論排行榜

4.1 索引創建類IndexWriter