在线一区视频观看,国产二区三区在线,日韩av在线一区二区

2010-01-04 傳智播客—luncene

Posted on 2010-01-04 21:26 長城閱讀(705) 評論(0) 編輯收藏

我們這些網(wǎng)民，見得多了檢索應(yīng)用了。Google、百度、論壇內(nèi)部搜索、網(wǎng)站內(nèi)部搜索…，這些應(yīng)用使用的就是檢索技術(shù)。今天我們學(xué)習(xí)的主要是針對WEB應(yīng)用的內(nèi)部文本檢索，比如論壇。我們使用的框架是lucene。授課老師是湯陽光，年輕有為！

搞技術(shù)的，看到什么新技術(shù)總是十分好奇。先不說google和baidu吧（他們的專業(yè)檢索技術(shù)很強哦），見到論壇內(nèi)部的檢索功能，就讓我十分好奇。我原本以為論壇內(nèi)部的檢索就是模糊查找數(shù)據(jù)庫，但今日的課程讓我學(xué)習(xí)到了論壇內(nèi)部真正的檢索技術(shù)。

一、信息檢索和全文檢索

信息檢索就是從信息集合中找出與用戶需求相關(guān)的信息。被檢索的信息除了文本外，還有圖像、音頻、視頻等多媒體信息。我們只關(guān)注文本的檢索，把用戶的查詢請求和全文中的每一個詞進行比較，不考慮查詢請求與文本語義上的匹配，這叫做全文檢索。在信息檢索工具中，全文檢索是最具通用性和實用性的。例如，使用百度從一大堆網(wǎng)頁中搜出與“傳智播客”相關(guān)的網(wǎng)頁。

我們簡單看一下檢索技術(shù)的流程：

請求

上面可見索引數(shù)據(jù)庫是十分重要的，簡單的說。全文檢索系統(tǒng)，將網(wǎng)絡(luò)上的數(shù)據(jù)通過某種格式保存到索引庫中。當(dāng)用戶發(fā)送查詢請求時，實質(zhì)上就是向索引庫查詢。全文檢索引擎負責(zé)處理用戶的請求用索引庫的更新等。

小時候查的漢語字典、英語詞典…我們是如何查詢的？當(dāng)然不是一頁一頁的翻了，靠的是字典的目錄。Lucene的檢索方式正是使用了此技術(shù)，lucene的索引庫格式：

檢索目錄		數(shù)據(jù)
索引	關(guān)鍵字	索引	Documents…
1	C、匯編	1	匯編比機器在語言高級…C比匯編高級
2	C++	2	C++比C高級..
3	Java、J2E	3	JAVA應(yīng)用級，最為優(yōu)秀的語言
4	WEB	4	WEB是真正的計算機…
5	你好	5	Hi，你好啊！
…	…	…

上面的表只是簡單說明lucene索引庫的存儲格式。我們通過lucene提供的類可以向索引庫添加、修改、查詢、刪除操作…。

關(guān)鍵字是通過分詞器分析出來的，一般情況下各有語言有各字的分詞器。分詞器的強大提高了查詢數(shù)據(jù)的接近性。

二、lucene操作

我們編寫一個對Article對象向lucene索引庫的添加、修改、查詢、刪除操作：

import java.util.ArrayList;

import java.util.List;

import org.apache.lucene.analysis.Analyzer;

import org.apache.lucene.analysis.standard.StandardAnalyzer;

import org.apache.lucene.index.IndexWriter;

import org.apache.lucene.index.Term;

import org.apache.lucene.index.IndexWriter.MaxFieldLength;

import org.apache.lucene.queryParser.MultiFieldQueryParser;

import org.apache.lucene.queryParser.QueryParser;

import org.apache.lucene.search.IndexSearcher;

import org.apache.lucene.search.Query;

import org.apache.lucene.search.TopDocs;

import cn.itcast.cc.lucene.helloword.Article;

import cn.itcast.cc.lucene.helloword.utils.ArticleDocUtils;

public class IndexDao {

// 索引目錄

private String indexPath = "./index";

// 分詞器

private Analyzer analyzer = new StandardAnalyzer();

/**

* 保存記錄

* @param art

public void save(Article art) {

// lucene的寫出索引類

IndexWriter indexWriter = null;

try {

indexWriter = new IndexWriter(this.indexPath, this.analyzer,

MaxFieldLength.LIMITED);

// 添加到索引庫

indexWriter.addDocument(ArticleDocUtils.Article2Doc(art));

} catch (Exception e) {

e.printStackTrace();

}

// 釋放indexWriter

if (indexWriter != null) {

try {

// 使用后一定要關(guān)閉

indexWriter.close();

} catch (Exception e) {

e.printStackTrace();

}

/**

* 更新記錄

* @param art

public void update(Article art) {

IndexWriter indexWriter = null;

try {

indexWriter = new IndexWriter(this.indexPath, this.analyzer,

MaxFieldLength.LIMITED);

Term term = new Term("id", art.getId() + "");

// 更新

indexWriter.updateDocument(term, ArticleDocUtils.Article2Doc(art));

} catch (Exception e) {

e.printStackTrace();

}

// 釋放indexWriter

if (indexWriter != null) {

try {

indexWriter.close();

} catch (Exception e) {

e.printStackTrace();

}

/**

* 刪除記錄

* @param id

public void delete(int id) {

IndexWriter indexWriter = null;

try {

indexWriter = new IndexWriter(this.indexPath, this.analyzer,

MaxFieldLength.LIMITED);

Term term = new Term("id", id + "");

// 刪除

indexWriter.deleteDocuments(term);

} catch (Exception e1) {

e1.printStackTrace();

}

// 釋放indexWriter

if (indexWriter != null) {

try {

indexWriter.close();

} catch (Exception e) {

e.printStackTrace();

}

/**

* 查詢記錄（具有分頁功能）

* @param queryString

* @param startIndex

* @param recordCount

* @return

public List search(String queryString, int startIndex, int recordCount) {

List result = new ArrayList();

IndexSearcher indexSearcher = null;

try {

indexSearcher = new IndexSearcher(this.indexPath);

String[] fields = new String[] {"title","content"};

// 分析查詢條件

QueryParser queryParser = new MultiFieldQueryParser(fields,

this.analyzer);

// 生成查詢對象

Query query = queryParser.parse(queryString);

int findTotalRecord = startIndex + recordCount;

// 查詢

TopDocs topDocs = indexSearcher.search(query, null, findTotalRecord);

//獲取分頁數(shù)據(jù)

int endIndex = Math.min(startIndex+recordCount,topDocs.totalHits);

for(int i=startIndex; i<endIndex;i++){

result.add(indexSearcher.doc(topDocs.scoreDocs[i].doc));

}

} catch (Exception e1) {

e1.printStackTrace();

}

// 釋放indexWriter

if (indexSearcher != null) {

try {

indexSearcher.close();

} catch (Exception e) {

e.printStackTrace();

}

return result;

}

其中使用到的“ArticleDocUtils”如下：

import org.apache.lucene.document.Document;

import org.apache.lucene.document.Field;

import org.apache.lucene.document.Field.Index;

import org.apache.lucene.document.Field.Store;

import cn.itcast.cc.lucene.helloword.Article;

public class ArticleDocUtils {

public static Document Article2Doc(Article art){

// lucene索引庫中的document對象

Document doc = new Document();

// 添加到document中的都是field對象，注冊其中的key，檢索時會用到。

// Store是否存儲，決定對象是臨時性還是持久性的

// Index在庫中建立什么樣的索引

doc.add(new Field("id", art.getId()+"", Store.YES, Index.NOT_ANALYZED));

doc.add(new Field("title", art.getTitle(), Store.YES, Index.ANALYZED));

doc.add(new Field("content", art.getContent(), Store.YES, Index.ANALYZED));

return doc;

}

public static Article Doc2Article(Document doc){

Article art = new Article();

art.setId(Integer.parseInt(doc.getField("id").stringValue()));

art.setTitle(doc.getField("title").stringValue());

art.setContent(doc.getField("content").stringValue());

return art;

}

需要導(dǎo)入的jar包：lucene-analyzers-2.4.0.jar、lucene-core-2.4.0.jar、lucene-highlighter-2.4.0.jar。

上面是對lunce的簡單操作，明天會講解檢索的分詞器、高亮、排序、過濾…

湯老師今天有留練習(xí)作業(yè)，我去做練習(xí)了哦~~ lucene很簡單、很方便~~

新用戶注冊刷新評論列表


只有注冊用戶登錄后才能發(fā)表評論。




網(wǎng)站導(dǎo)航: 博客園 IT新聞 Chat2DB C++博客博問管理

充滿生活的味道！

2010-01-04 傳智播客—luncene

日歷

常用鏈接

留言簿(11)

隨筆檔案(76)

文章檔案(3)

搜索

積分與排名

最新評論

閱讀排行榜

評論排行榜