用 Lucene 做一個簡單的 Java 搜索工具
初學
Lucene
,剛接觸搜索引擎。知道了一點點,想做個小工具,實現根據
“
單詞
”
搜索某個
java
源文件。比如輸入
“String”
去查詢某些
java
源文件里用到了這個類。
這個想法的來源是,在以前剛學
java
時,有一本
java
基礎教程的書的附帶光盤里有作者寫的一個程序,可以方便初學者查找某些類在哪個實例里出現。當時沒有太在意,覺得作者的代碼很長。所以現在想自己也寫一個這樣的小程序。
開發工具與運行環境:
使用
Lucene2.0
的包,
jdk1.5
,在
WindowsXP
下運行。
思路分析與設計:
整個程序里,除了
Lucene
的必要操作外,就是
IO
的基本操作了。因為要對某目錄下及其子目錄下的所有
Java
源文件進行索引,就要用到遞歸,同時要過濾掉非
Java
源文件。根據這種情況,設計了以下
5
個類。
主類
:索引類(
IndexJavaFiles
),搜索類(
SearchJavaFiles
)
異常類
:索引異常類
(IndexException)
,搜索異常類
(SearchException)
還有一個文件過濾工廠類(
FileFilterFactory
)。
異常類不是必要的,特意設計來包裝
IO
異常、文件異常和
Lucene
的異常。文件過濾工廠類的出現并不是故弄玄虛,只是不想太多代碼集中一起,就把文件過慮器的設計放到一個類里。下面是程序的完整代碼及注釋。
IndexJavaFiles.java
/**
?
*
index
the
java
source
files
?
*/
package
powerwind;
import
java.io.*;
import
java.util.Date;
import
org.apache.lucene.document.*;
import
org.apache.lucene.index.IndexWriter;
/**
?
*
@author
Powerwind
?
*
@version
1.0
?
*/
public
class
IndexJavaFiles {
???
/**
???
?
*
默認構造方法
???
?
*/
???
public
IndexJavaFiles() {
??? }
???
/**
???
?
*
這個私有遞歸方法由
index
方法調用,保證
index
傳入的
file
是目錄不是文件
???
?
*
???
?
*
@param
writer
???
?
*
@param
file
???
?
*
@param
ff
???
?
*
@throws
IndexException
???
?
*/
???
private
void
indexDirectory(IndexWriter writer, File file, FileFilter filter)
throws
IndexException {
??????
if
(file.isDirectory()) {
??????????
//
有選擇地(過濾)獲取目錄下的文件和目錄
?????????? File[] files = file.listFiles(filter);
??????????
//
非空目錄
??????????
if
(files !=
null
) {
?????????????
for
(
int
i = 0; i < files.
length
; i++) {
????????????????? indexDirectory(writer, files[i], filter);
????????????? }
?????????? }
?????? }
else
{
??????????
try
{
??????????
//
這里的
file
經過先前的過濾
????????????? writer.addDocument(parseFile(file));
????????????? System.
out
.println(
"
增加文件:
"
+ file);
?????????? }
catch
(IOException ioe) {
?????????????
throw
new
IndexException(ioe.getMessage());
?????????? }
?????? }
??? }
???
/**
???
?
*
傳參數是文件就直接索引,若是目錄則交給
indexDirectory
遞歸
???
?
*
???
?
*
@param
writer
???
?
*
@param
file
???
?
*
@param
ff
???
?
*
@throws
IndexException
???
?
*/
???
public
void
index(IndexWriter writer, File file, FileFilter filter)
throws
IndexException {
??????
//
確定可讀
??????
if
(file.exists() && file.canRead()) {
??????????
if
(file.isDirectory()) {
????????????? indexDirectory(writer, file, filter);
?????????? }
else
if
(filter.accept(file)) {
?????????????
try
{
????????????????? writer.addDocument(parseFile(file));
????????????????? System.
out
.println(
"
增加文件:
"
+ file);
????????????? }
catch
(IOException ioe) {
?????????????????
throw
new
IndexException(ioe.getMessage());
????????????? }
?????????? }
else
{
????????????? System.
out
.println(
"
指定文件或目錄錯誤,沒有完成索引
"
);
?????????? }
?????? }
??? }
???
/**
???
?
*
@param
file
???
?
*
???
?
*
把
File
變成
Document
???
?
*/
???
private
Document parseFile(File file)
throws
IndexException {
?????? Document doc =
new
Document();
?????? doc.add(
new
Field(
"path"
, file.getAbsolutePath(), Field.Store.
YES
,
???????????????????? Field.Index.
UN_TOKENIZED
));
??????
try
{
?????????? doc.add(
new
Field(
"contents"
,
new
FileReader(file)));
?????? }
catch
(FileNotFoundException fnfe) {
??????????
throw
new
IndexException(fnfe.getMessage());
?????? }
??????
return
doc;
??? }
}
index(IndexWriter writer, File file, FileFilter filter)
調用私有方法
indexDirectory(IndexWriter writer, File file, FileFilter filter)
完成文件的索引。
下面是
IndexException
異常類。
IndexException.java
package
powerwind;
public
class
IndexException
extends
Exception {
???
public
IndexException(String message) {
??????
super
(
"Throw IndexException while indexing files: "
+ message);
??? }
}
下面是 FileFilterFactory 類,返回一個特定的文件過濾器( FileFilter )。
FileFilterFactory.java
package
powerwind;
import
java.io.*;
public
class
FileFilterFactory {
???
/**
???
?
*
靜態匿名內部類
???
?
*/
???
private
static
FileFilter
filter
=
new
FileFilter() {
??????
public
boolean
accept(File file) {
??????????
long
len;
??????????
return
file.isDirectory()||
?????????????????
?(file.getName().endsWith(
".java"
) &&
?????????????????
?((len = file.length()) > 0) && len < 1024 * 1024);
?????? }
??? };
???
public
static
FileFilter getFilter() {
??????
return
filter
;
??? }
}
main 方法
???
/**
???
?
*
?????
main
方法
???
?
*/
???
public
static
void
main(String[] args)
throws
Exception {
?????? IndexJavaFiles ijf =
new
IndexJavaFiles();
?????? Date start =
new
Date();
??????
try
{
?????????? IndexWriter writer = IndexWriterFactory.newInstance().createWriter(
"./index"
,
true
);
?????????? System.
out
.println(
"Indexing ..."
);
?????????? ijf.index(writer,
new
File(
"."
), FileFilterFactory.getFilter());
?????????? System.
out
.println(
"Optimizing..."
);
?????????? writer.optimize();
?????????? writer.close();
?????????? Date end =
new
Date();
?????????? System.
out
.println(end.getTime() - start.getTime() +
" total milliseconds"
);
?????? }
catch
(IOException e) {
?????????? System.
out
.println(
" caught a "
+ e.getClass() +
"\n with message: "
+ e.getMessage());
?????? }
??? }
SearchJavaFiles.java
package
powerwind;
import
java.io.*;
import
org.apache.lucene.analysis.Analyzer;
import
org.apache.lucene.analysis.standard.StandardAnalyzer;
import
org.apache.lucene.document.Document;
import
org.apache.lucene.index.IndexReader;
import
org.apache.lucene.queryParser.*;
import
org.apache.lucene.search.*;
public
class
SearchJavaFiles {
???
private
IndexSearcher
searcher
;
???
private
QueryParser
parser
;
???
/**
???
?
*
???
?
*
@param
searcher
???
?
*/
???
public
SearchJavaFiles(IndexSearcher searcher) {
??????
this
.
searcher
= searcher;
??? }
???
/**
???
?
*
???
?
*
@param
field
???
?
*
@param
analyzer
???
?
*/
???
public
void
setParser(String field, Analyzer analyzer) {
?????? setParser(
new
QueryParser(field, analyzer));
??? }
???
/**
???
?
*
@param
parser
???
?
*/
???
public
void
setParser(QueryParser parser) {
??????
this
.
parser
= parser;
??? }
???
/**
???
?
*
???
?
*
@param
query
???
?
*
@return
Hits
???
?
*
@throws
SearchException
???
?
*/
???
public
Hits serach(Query query)
throws
SearchException {
??????
try
{
??????????
return
searcher
.search(query);
?????? }
catch
(IOException ioe) {
??????????
throw
new
SearchException(ioe.getMessage());
?????? }
??? }
???
/**
???
?
*
???
?
*
@param
queryString
???
?
*
@return
Hits
???
?
*
@throws
SearchException
???
?
*/
???
public
Hits serach(String queryString)
throws
SearchException {
??????
if
(
parser
==
null
)
??????????
throw
new
SearchException(
"parser is null!"
);
??????
try
{
??????????
return
searcher
.search(
parser
.parse(queryString));
?????? }
catch
(IOException ioe) {
??????????
throw
new
SearchException(ioe.getMessage());
?????? }
catch
(ParseException pe) {
??????????
throw
new
SearchException(pe.getMessage());
?????? }
??? }
???
/**
???
?
*
???
?
*
輸出
hits
的結果,從
start
開始到
end
,不包括
end
???
?
*
???
?
*
@param
hits
???
?
*
@param
start
???
?
*
@param
end
???
?
*
@throws
SearchException
???
?
*/
???
public
static
Hits display(Hits hits,
int
start,
int
end)
throws
SearchException {
??????
try
{
??????????
while
(start < end) {
????????????? Document doc = hits.doc(start);
????????????? String path = doc.get(
"path"
);
?????????????
if
(path !=
null
) {
????????????????? System.
out
.println((start + 1) +
"- "
+ path);
????????????? }
else
{
????????????????? System.
out
.println((start + 1) +
"- "
+
"No such path"
);
????????????? }
????????????? start++;
?????????? }
?????? }
catch
(IOException ioe) {
??????????
throw
new
SearchException(ioe.getMessage());
?????? }
??????
return
hits;
??? }
main
方法
???
/**
???
?
*
@param
args
???
?
*/
???
public
static
void
main(String[] args)
throws
Exception {
?????? String field =
"contents"
;
?????? String index =
"./index"
;
??????
final
int
rows_per_page = 2;
??????
final
char
NO =
'n'
;
?????? SearchJavaFiles sjf =
new
SearchJavaFiles(
new
IndexSearcher(IndexReader.open(index)));
?????? sjf.setParser(field,
new
StandardAnalyzer());
?????? BufferedReader in =
new
BufferedReader(
new
InputStreamReader(System.
in
,
"UTF-8"
));
??????
while
(
true
) {
?????????? System.
out
.println(
"Query: "
);
?????????? String line = in.readLine();
??????????
if
(line ==
null
|| line.length() < 2) {
????????????? System.
out
.println(
"eixt query"
);
?????????????
break
;
?????????? }
?????????? Hits hits = sjf.serach(line);
?????????? System.
out
.println(
"searching for "
+ line +
" Result is "
);
??????????
int
len = hits.length();
??????????
int
i = 0;
??????????
if
(len > 0)
?????????????
while
(
true
) {
?????????????????
if
(i + rows_per_page >= len) {
???????????????????? SearchJavaFiles.display(hits, i, len);
????????????????????
break
;
????????????????? }
else
{
???????????????????? SearchJavaFiles.display(hits, i, i += rows_per_page);
???????????????????? System.
out
.println(
"more y/n?"
);
???????????????????? line = in.readLine();
????????????????????
if
(line.length() < 1 || line.charAt(0) == NO)
????????????????????????
break
;
????????????????? }
????????????? }
??????????
else
????????????? System.
out
.println(
"not found"
);
?????? }
??? }
}
SearchException.java
package
powerwind;
public
class
SearchException
extends
Exception {
???
public
SearchException(String message) {
??????
super
(
"Throw SearchException while searching files: "
+ message);
??? }
}
完善設想:
1 、文件格式:
能夠處理
Zip
文件
Jar
文件,索引里面的
java
源文件。
通過反射機制索引
class
類文件。
2 、輸入輸出:
除控制臺輸入輸出外,還可以選擇從文件讀取查詢關鍵字,輸出查詢結果到文件。
3 、用戶界面:
圖形界面操作,雙擊查詢結果的某條記錄可以打開相應文件。
4
、性能方面
索引文件時,用緩存和多線程處理
?