夢幻e家人

          java咖啡
          隨筆 - 15, 文章 - 0, 評論 - 11, 引用 - 0
          數(shù)據(jù)加載中……

          2007年8月7日

          Lucene關(guān)鍵字高亮顯示

          Lucene關(guān)鍵字高亮顯示
           
           

          在Lucene的org.apache.lucene.search.highlight包中提供了關(guān)于高亮顯示檢索關(guān)鍵字的工具。使用百度、Google搜索的時候,檢索結(jié)果顯示的時候,在摘要中實現(xiàn)與關(guān)鍵字相同的詞條進行高亮顯示,百度和Google指定紅色高亮顯示。

          有了Lucene提供的高亮顯示的工具,可以很方便地實現(xiàn)高亮顯示的功能。

          高亮顯示,就是根據(jù)用戶輸入的檢索關(guān)鍵字,檢索找到該關(guān)鍵字對應(yīng)的檢索結(jié)果文件,提取對應(yīng)于該文件的摘要文本,然后根據(jù)設(shè)置的高亮格式,將格式寫入到摘要文本中對應(yīng)的與關(guān)鍵字相同或相似的詞條上,在網(wǎng)頁上顯示出來,該摘要中的與關(guān)鍵字有關(guān)的文本就會以高亮的格式顯示出來。

          Lucene中org.apache.lucene.search.highlight.SimpleHTMLFormatter類可以構(gòu)造一個高亮格式,這是最簡單的構(gòu)造方式,例如:

          SimpleHTMLFormatter simpleHTMLFormatter = new SimpleHTMLFormatter("<font color='red'>", "</font>");

          構(gòu)造方法聲明為public SimpleHTMLFormatter(String preTag, String postTag),因為這種高亮格式是依賴于網(wǎng)頁文件的,HTML文件中是以標(biāo)記(tag)來標(biāo)識的,即存在一個preTag和一個postTag。

          上面構(gòu)造的高亮格式是摘要中出現(xiàn)的關(guān)鍵字使用紅色來顯示,區(qū)分其它文本。

          通過構(gòu)造好的高亮格式對象,來構(gòu)造一個org.apache.lucene.search.highlight.Highlighter實例,然后根據(jù)對檢索結(jié)果得到的Field的文本內(nèi)容(這里是指摘要文本)進行切分,找到與檢索關(guān)鍵字相同或相似的詞條,將高亮格式加入到摘要文本中,返回一個新的、帶有格式的摘要文本,在網(wǎng)頁上就可以呈現(xiàn)高亮顯示。

          下面實現(xiàn)一個簡單的例子,展示實現(xiàn)高亮顯示的處理過程。

          測試類如下所示:

          package org.shirdrn.lucene.learn.highlight;

          import java.io.IOException;
          import java.io.StringReader;

          import net.teamhot.lucene.ThesaurusAnalyzer;

          import org.apache.lucene.analysis.Analyzer;
          import org.apache.lucene.analysis.TokenStream;
          import org.apache.lucene.document.Document;
          import org.apache.lucene.document.Field;
          import org.apache.lucene.index.CorruptIndexException;
          import org.apache.lucene.index.IndexWriter;
          import org.apache.lucene.queryParser.ParseException;
          import org.apache.lucene.queryParser.QueryParser;
          import org.apache.lucene.search.Hits;
          import org.apache.lucene.search.IndexSearcher;
          import org.apache.lucene.search.Query;
          import org.apache.lucene.search.highlight.Highlighter;
          import org.apache.lucene.search.highlight.QueryScorer;
          import org.apache.lucene.search.highlight.SimpleFragmenter;
          import org.apache.lucene.search.highlight.SimpleHTMLFormatter;

          public class MyHighLighter {

          private String indexPath = "F:\\index";
          private Analyzer analyzer;
          private IndexSearcher searcher;

          public MyHighLighter(){
             analyzer = new ThesaurusAnalyzer();
          }

          public void createIndex() throws IOException {   // 該方法建立索引
             IndexWriter writer = new IndexWriter(indexPath,analyzer,true);
             Document docA = new Document();
             String fileTextA = "因為火燒云總是燃燒著消失在太陽沖下地平線的時刻,然后便是寧靜的自然的天籟,沒有誰會在這樣的時光的鏡片里傷感自語,因為燦爛給人以安靜的舒適感。";
             Field fieldA = new Field("contents", fileTextA, Field.Store.YES,Field.Index.TOKENIZED);
             docA.add(fieldA);
            
             Document docB = new Document();
             String fileTextB = "因為帶有以傷痕為代價的美麗風(fēng)景總是讓人不由地惴惴不安,緊接著襲面而來的抑或是病痛抑或是災(zāi)難,沒有誰會能夠安逸著恬然,因為模糊讓人撕心裂肺地想?yún)群啊?;
             Field fieldB = new Field("contents", fileTextB, Field.Store.YES,Field.Index.TOKENIZED);
             docB.add(fieldB);
            
             Document docC = new Document();
             String fileTextC = "我喜歡上了一個人孤獨地行游,在夢與海洋的交接地帶熾烈燃燒著。"+
             "因為,一條孤獨的魚喜歡上了火焰的顏色,真是荒唐地不合邏輯。";
             Field fieldC = new Field("contents", fileTextC, Field.Store.YES,Field.Index.TOKENIZED);
             docC.add(fieldC);
            
             writer.addDocument(docA);
             writer.addDocument(docB);
             writer.addDocument(docC);
             writer.optimize();
             writer.close();
          }

          public void search(String fieldName,String keyword) throws CorruptIndexException, IOException, ParseException{   // 檢索的方法,并實現(xiàn)高亮顯示
             searcher = new IndexSearcher(indexPath);
             QueryParser queryParse = new QueryParser(fieldName, analyzer);     //   構(gòu)造QueryParser,解析用戶輸入的檢索關(guān)鍵字
             Query query = queryParse.parse(keyword);
             Hits hits = searcher.search(query);
             for(int i=0;i<hits.length();i++){
              Document doc = hits.doc(i);
              String text = doc.get(fieldName);
              SimpleHTMLFormatter simpleHTMLFormatter = new SimpleHTMLFormatter("<font color='red'>", "</font>");   
                      Highlighter highlighter = new Highlighter(simpleHTMLFormatter,new QueryScorer(query));   
                      highlighter.setTextFragmenter(new SimpleFragmenter(text.length()));      
                      if (text != null) {   
                          TokenStream tokenStream = analyzer.tokenStream(fieldName,new StringReader(text));   
                          String highLightText = highlighter.getBestFragment(tokenStream, text);
                          System.out.println("★高亮顯示第 "+(i+1) +" 條檢索結(jié)果如下所示:");
                          System.out.println(highLightText);   
                      }
             }
             searcher.close();
          }


          public static void main(String[] args) {    // 測試主函數(shù)
             MyHighLighter mhl = new MyHighLighter();
             try {
              mhl.createIndex();
              mhl.search("contents", "因為");
             } catch (CorruptIndexException e) {
              e.printStackTrace();
             } catch (IOException e) {
              e.printStackTrace();
             } catch (ParseException e) {
              e.printStackTrace();
             }
          }

          }

          程序說明:

          1、createIndex()方法:使用ThesaurusAnalyzer分析器為指定的文本建立索引。每個Document中都有一個name為contents的Field。在實際應(yīng)用中,可以再構(gòu)造一一個name為path的Field,指定檢索到的文件的路徑(本地路徑或者網(wǎng)絡(luò)上的鏈接地址)

          2、根據(jù)已經(jīng)建好的索引庫進行檢索。這首先需要解析用戶輸入的檢索關(guān)鍵字,使用QueryParser,必須與后臺使用的分析器相同,否則不能保證解析得到的查詢(由詞條構(gòu)造)Query檢索到合理的結(jié)果集。

          3、根據(jù)解析出來的Query進行檢索,檢索結(jié)果集保存在Hits中。遍歷,提取每個滿足條件的Document的內(nèi)容,程序中直接把它的內(nèi)容當(dāng)作摘要內(nèi)容,實現(xiàn)高亮顯示。在實際應(yīng)用中,應(yīng)該對應(yīng)著一個提取摘要(或者檢索數(shù)據(jù)庫得到檢索關(guān)鍵字對應(yīng)的結(jié)果集文件的摘要內(nèi)容)的過程。有了摘要以后,就可以為摘要內(nèi)容增加高亮格式。

          4、如果提取結(jié)果集文件的前N個字符串作為摘要,只需要在 highlighter.setTextFragmenter(new SimpleFragmenter(text.length())); 中設(shè)置顯示摘要的字?jǐn)?shù),這里顯示全部的文本作為摘要。

          運行程序,結(jié)果如下所示:

          詞庫尚未被初始化,開始初始化詞庫.
          初始化詞庫結(jié)束。用時:3906毫秒;
          共添加195574個詞語。
          ★高亮顯示第 1 條檢索結(jié)果如下所示:
          <font color='red'>因為</font>火燒云總是燃燒著消失在太陽沖下地平線的時刻,然后便是寧靜的自然的天籟,沒有誰會在這樣的時光的鏡片里傷感自語,<font color='red'>因為</font>燦爛給人以安靜的舒適感。
          ★高亮顯示第 2 條檢索結(jié)果如下所示:
          <font color='red'>因為</font>帶有以傷痕為代價的美麗風(fēng)景總是讓人不由地惴惴不安,緊接著襲面而來的抑或是病痛抑或是災(zāi)難,沒有誰會能夠安逸著恬然,<font color='red'>因為</font>模糊讓人撕心裂肺地想?yún)群啊?br /> ★高亮顯示第 3 條檢索結(jié)果如下所示:
          我喜歡上了一個人孤獨地行游,在夢與海洋的交接地帶熾烈燃燒著。<font color='red'>因為</font>,一條孤獨的魚喜歡上了火焰的顏色,真是荒唐地不合邏輯。

          上面的檢索結(jié)果在HTML網(wǎng)頁中,就會高亮顯示關(guān)鍵字“因為”,顯示為紅色。

          posted @ 2008-08-06 11:24 軒轅 閱讀(207) | 評論 (0)編輯 收藏

          Lucene關(guān)鍵字高亮顯示

          package searchfileexample;

          import javax.servlet.*;
          import javax.servlet.http.*;
          import java.io.*;
          import java.io.IOException;
          import java.io.StringReader;

          import org.apache.lucene.analysis.Analyzer;
          import org.apache.lucene.analysis.TokenStream;
          import org.apache.lucene.document.Document;
          import org.apache.lucene.document.Field;
          import org.apache.lucene.index.CorruptIndexException;
          import org.apache.lucene.index.IndexWriter;
          import org.apache.lucene.queryParser.ParseException;
          import org.apache.lucene.queryParser.QueryParser;
          import org.apache.lucene.search.Hits;
          import org.apache.lucene.search.IndexSearcher;
          import org.apache.lucene.search.Query;
          import org.apache.lucene.search.highlight.Highlighter;
          import org.apache.lucene.search.highlight.QueryScorer;
          import org.apache.lucene.search.highlight.SimpleFragmenter;
          import org.apache.lucene.search.highlight.SimpleHTMLFormatter;
          import org.apache.lucene.analysis.standard.StandardAnalyzer;


          public class MyHighLighterServlet extends HttpServlet {
            private static final String CONTENT_TYPE = "text/html; charset=GB18030";

            private String indexPath = "C:\\index";
            private Analyzer analyzer;
            private IndexSearcher searcher;

            //Initialize global variables
            public void init() throws ServletException {
              analyzer = new StandardAnalyzer();
            }
            public void createIndex() throws IOException {   // 該方法建立索引
                 IndexWriter writer = new IndexWriter(indexPath,analyzer,true);
                 Document docA = new Document();
                 String fileTextA = "因為火燒云總是燃燒著消失在太陽沖下地平線的時刻,然后便是寧靜的自然的天籟,沒有誰會在這樣的時光的鏡片里傷感自語,因為燦爛給人以安靜的舒適感。";
                 Field fieldA = new Field("contents", fileTextA, Field.Store.YES,Field.Index.TOKENIZED);
                 docA.add(fieldA);
           
                 Document docB = new Document();
                 String fileTextB = "因為帶有以傷痕為代價的美麗風(fēng)景總是讓人不由地惴惴不安,緊接著襲面而來的抑或是病痛抑或是災(zāi)難,沒有誰會能夠安逸著恬然,因為模糊讓人撕心裂肺地想?yún)群啊?;
                 Field fieldB = new Field("contents", fileTextB, Field.Store.YES,Field.Index.TOKENIZED);
                 docB.add(fieldB);
           
                 Document docC = new Document();
                 String fileTextC = "我喜歡上了一個人孤獨地行游,在夢與海洋的交接地帶熾烈燃燒著。"+
                 "因為,一條孤獨的魚喜歡上了火焰的顏色,真是荒唐地不合邏輯,原因。";
                 Field fieldC = new Field("contents", fileTextC, Field.Store.YES,Field.Index.TOKENIZED);
                 docC.add(fieldC);
           
                 writer.addDocument(docA);
                 writer.addDocument(docB);
                 writer.addDocument(docC);
                 writer.optimize();
                 writer.close();
              }
           
              public void search(String fieldName,String keyword,PrintWriter out) throws CorruptIndexException, IOException, ParseException{   // 檢索的方法,并實現(xiàn)高亮顯示
                 searcher = new IndexSearcher(indexPath);
                 QueryParser queryParse = new QueryParser(fieldName, analyzer);     //   構(gòu)造QueryParser,解析用戶輸入的檢索關(guān)鍵字
                 Query query = queryParse.parse(keyword);
                 Hits hits = searcher.search(query);
                 for(int i=0;i<hits.length();i++){
                  Document doc = hits.doc(i);
                  String text = doc.get(fieldName);
                  SimpleHTMLFormatter simpleHTMLFormatter = new SimpleHTMLFormatter("<font color='red'>", "</font>");   
                          Highlighter highlighter = new Highlighter(simpleHTMLFormatter,new QueryScorer(query));   
                          highlighter.setTextFragmenter(new SimpleFragmenter(text.length()));      
                          if (text != null) {   
                              TokenStream tokenStream = analyzer.tokenStream(fieldName,new StringReader(text));   
                              String highLightText = highlighter.getBestFragment(tokenStream, text);
                              System.out.println("★高亮顯示第 "+(i+1) +" 條檢索結(jié)果如下所示:");
                              out.println(highLightText);   
                          }
                 }
                 searcher.close();
              }

            //Process the HTTP Get request
            public void service(HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException {
              response.setContentType(CONTENT_TYPE);
              PrintWriter out = response.getWriter();
              out.println("<html>");
              out.println("<head><title>MyHighLighterServlet</title></head>");
              out.println("<body bgcolor=\"#ffffff\">");

            
               try {
                createIndex();
                search("contents", "因為",out);
               } catch (CorruptIndexException e) {
                e.printStackTrace();
               } catch (IOException e) {
                e.printStackTrace();
               } catch (ParseException e) {
                e.printStackTrace();
               }

             
             
             
              out.println("</body></html>");
            }

            //Clean up resources
            public void destroy() {
            }
          }

          posted @ 2008-08-06 11:22 軒轅 閱讀(866) | 評論 (0)編輯 收藏

          _blank _self的含義

          _blank -- 打開一個新窗體
          _parent -- 在父窗體中打開
          _self -- 在本頁打開,此為默認(rèn)值
          _top -- 在上層窗體中打開
          _search --同時打開搜索窗口

          一個對應(yīng)的框架頁的名稱 -- 在對應(yīng)框架頁中打開

          posted @ 2008-07-08 10:23 軒轅 閱讀(209) | 評論 (0)編輯 收藏

          prototype.js開發(fā)筆記

               摘要: Table of Contents 1. Programming Guide 1.1. Prototype是什么? 1.2. 關(guān)聯(lián)文章 1.3. 通用性方法 1.3.1. 使用 $()方法 1.3.2. 使用$F()方法 1.3.3. 使用$A()方法 1.3.4. 使用$H()方法 1.3.5. 使用$R()方法 1.3.6. 使用Try.these()方...  閱讀全文

          posted @ 2008-06-05 15:56 軒轅 閱讀(173) | 評論 (0)編輯 收藏

          全文檢索第二版,分別對TXT,WORD,EXCEL文件進行了處理

          package searchfileexample;

          /**
           * 讀取Excel文件
           */
          import java.io.*;
          import org.apache.poi.hssf.usermodel.HSSFWorkbook;
          import org.apache.poi.hssf.usermodel.HSSFSheet;
          import org.apache.poi.hssf.usermodel.HSSFCell;
          import org.apache.poi.hssf.usermodel.HSSFDateUtil;
          import java.util.Date;
          import org.apache.poi.hssf.usermodel.HSSFRow;

          public class ExcelReader {
            // 創(chuàng)建文件輸入流
            private BufferedReader reader = null;

            // 文件類型
            private String filetype;

            // 文件二進制輸入流
            private InputStream is = null;

            // 當(dāng)前的Sheet
            private int currSheet;

            // 當(dāng)前位置
            private int currPosition;

            // Sheet數(shù)量
            private int numOfSheets;

            // HSSFWorkbook
            HSSFWorkbook workbook = null;
            // 設(shè)置Cell之間以空格分割
            private static String EXCEL_LINE_DELIMITER = " ";

            // 設(shè)置最大列數(shù)
            private static int MAX_EXCEL_COLUMNS = 64;

            public int rows = 0;
            public int getRows() {
              return rows;
            }

            // 構(gòu)造函數(shù)創(chuàng)建一個ExcelReader

            public ExcelReader(String inputfile) throws IOException, Exception {
              // 判斷參數(shù)是否為空或沒有意義
              if (inputfile == null || inputfile.trim().equals("")) {
                throw new IOException("no input file specified");
              }
              // 取得文件名的后綴名賦值給filetype
              this.filetype = inputfile.substring(inputfile.lastIndexOf(".") + 1);
              // 設(shè)置開始行為0
              currPosition = 0;
              // 設(shè)置當(dāng)前位置為0
              currSheet = 0;
              // 創(chuàng)建文件輸入流
              is = new FileInputStream(inputfile);
              // 判斷文件格式
              if (filetype.equalsIgnoreCase("txt")) {
                // 如果是txt則直接創(chuàng)建BufferedReader讀取
                reader = new BufferedReader(new InputStreamReader(is));
              }
              else if (filetype.equalsIgnoreCase("xls")) {
                // 如果是Excel文件則創(chuàng)建HSSFWorkbook讀取
                workbook = new HSSFWorkbook(is);
                // 設(shè)置Sheet數(shù)
                numOfSheets = workbook.getNumberOfSheets();
              }
              else {
                throw new Exception("File Type Not Supported");
              }
            }

            // 函數(shù)readLine讀取文件的一行
            public String readLine() throws IOException {
              // 如果是txt文件則通過reader讀取
              if (filetype.equalsIgnoreCase("txt")) {
                String str = reader.readLine();
                // 空行則略去,直接讀取下一行
                while (str.trim().equals("")) {
                  str = reader.readLine();
                }
                return str;
              }
              // 如果是XLS文件則通過POI提供的API讀取文件
              else if (filetype.equalsIgnoreCase("xls")) {
                // 根據(jù)currSheet值獲得當(dāng)前的sheet
                HSSFSheet sheet = workbook.getSheetAt(currSheet);
                rows = sheet.getLastRowNum();
                // 判斷當(dāng)前行是否到但前Sheet的結(jié)尾
                if (currPosition > sheet.getLastRowNum()) {
                  // 當(dāng)前行位置清零
                  currPosition = 0;
                  // 判斷是否還有Sheet
                  while (currSheet != numOfSheets - 1) {
                    // 得到下一張Sheet
                    sheet = workbook.getSheetAt(currSheet + 1);
                    // 當(dāng)前行數(shù)是否已經(jīng)到達(dá)文件末尾
                    if (currPosition == sheet.getLastRowNum()) {
                      // 當(dāng)前Sheet指向下一張Sheet
                      currSheet++;
                      continue;
                    }
                    else {
                      // 獲取當(dāng)前行數(shù)
                      int row = currPosition;
                      currPosition++;
                      // 讀取當(dāng)前行數(shù)據(jù)
                      return getLine(sheet, row);
                    }
                  }
                  return null;
                }
                // 獲取當(dāng)前行數(shù)
                int row = currPosition;
                currPosition++;
                // 讀取當(dāng)前行數(shù)據(jù)
                return getLine(sheet, row);
              }
              return null;
            }

            // 函數(shù)getLine返回Sheet的一行數(shù)據(jù)
            private String getLine(HSSFSheet sheet, int row) {
              // 根據(jù)行數(shù)取得Sheet的一行
              HSSFRow rowline = sheet.getRow(row);
              // 創(chuàng)建字符創(chuàng)緩沖區(qū)
              StringBuffer buffer = new StringBuffer();
              // 獲取當(dāng)前行的列數(shù)
              int filledColumns = rowline.getLastCellNum();
              HSSFCell cell = null;
              // 循環(huán)遍歷所有列
              for (int i = 0; i < filledColumns; i++) {
                // 取得當(dāng)前Cell
                cell = rowline.getCell( (short) i);
                String cellvalue = null;
                if (cell != null) {
                  // 判斷當(dāng)前Cell的Type
                  switch (cell.getCellType()) {
                    // 如果當(dāng)前Cell的Type為NUMERIC
                    case HSSFCell.CELL_TYPE_NUMERIC: {
                      // 判斷當(dāng)前的cell是否為Date
                      if (HSSFDateUtil.isCellDateFormatted(cell)) {
                        // 如果是Date類型則,取得該Cell的Date值
                        Date date = cell.getDateCellValue();
                        // 把Date轉(zhuǎn)換成本地格式的字符串
                        cellvalue = cell.getDateCellValue().toLocaleString();
                      }
                      // 如果是純數(shù)字
                      else {
                        // 取得當(dāng)前Cell的數(shù)值
                        Integer num = new Integer( (int) cell
                                                  .getNumericCellValue());
                        cellvalue = String.valueOf(num);
                      }
                      break;
                    }
                    // 如果當(dāng)前Cell的Type為STRIN
                    case HSSFCell.CELL_TYPE_STRING:

                      // 取得當(dāng)前的Cell字符串
                      cellvalue = cell.getStringCellValue().replaceAll("'", "''");
                      break;
                      // 默認(rèn)的Cell值
                    default:
                      cellvalue = " ";
                  }
                }
                else {
                  cellvalue = "";
                }
                // 在每個字段之間插入分割符
                buffer.append(cellvalue).append(EXCEL_LINE_DELIMITER);
              }
              // 以字符串返回該行的數(shù)據(jù)
              return buffer.toString();
            }

            // close函數(shù)執(zhí)行流的關(guān)閉操作
            public void close() {
              // 如果is不為空,則關(guān)閉InputSteam文件輸入流
              if (is != null) {
                try {
                  is.close();
                }
                catch (IOException e) {
                  is = null;
                }
              }
              // 如果reader不為空則關(guān)閉BufferedReader文件輸入流
              if (reader != null) {
                try {
                  reader.close();
                }
                catch (IOException e) {
                  reader = null;
                }
              }
            }

            public static void main(String[] args) {
              try {
                ExcelReader er = new ExcelReader("d:\\xp.xls");
                String line = er.readLine();
                while (line != null) {
                  System.out.println(line);
                  line = er.readLine();
                }
                er.close();
              }
              catch (Exception e) {
                e.printStackTrace();
              }
            }

          }

          package searchfileexample;

          import javax.servlet.*;
          import javax.servlet.http.*;
          import java.io.*;
          import java.util.*;

          import org.apache.lucene.analysis.standard.StandardAnalyzer;
          import org.apache.lucene.index.IndexWriter;

          import java.io.File;
          import java.io.FileNotFoundException;
          import java.io.IOException;
          import java.util.Date;
          import org.apache.lucene.demo.FileDocument;
          import org.apache.lucene.document.Document;
          import org.apache.lucene.document.Field;
          import java.io.FileReader;
          import org.apache.lucene.index.*;
          import java.text.DateFormat;
          import org.apache.poi.hdf.extractor.WordDocument;
          import java.io.InputStream;
          import java.io.StringWriter;
          import java.io.PrintWriter;
          import java.io.FileInputStream;
          import java.io.*;
          import org.textmining.text.extraction.WordExtractor;
          import org.apache.poi.hssf.usermodel.HSSFWorkbook;

          /**
           * 給某個目錄下的所有文件生成索引
           * <p>Title: </p>
           * <p>Description: </p>
           * <p>Copyright: Copyright (c) 2007</p>
           * <p>Company: </p>
           * @author not attributable
           * @version 1.0
           * 根據(jù)文件的不同,可以把索引文件創(chuàng)建到不同的文件夾下去,這樣可以分類保存索引信息。
           */

          public class IndexFilesServlet
              extends HttpServlet {
            static final File INDEX_DIR = new File("index");

            //Initialize global variables
            public void init() throws ServletException {
            }

            //Process the HTTP Get request
            public void service(HttpServletRequest request, HttpServletResponse response) throws
                ServletException, IOException {
              final File docDir = new File("a"); //需要生成索引的文件的文件夾
              if (!docDir.exists() || !docDir.canRead()) {
                System.out.println("Document directory '" + docDir.getAbsolutePath() +
                                   "' does not exist or is not readable, please check the path");
                System.exit(1);
              }

              Date start = new Date();
              try {
                IndexWriter writer = new IndexWriter(INDEX_DIR, new StandardAnalyzer(), true); //true-覆蓋原有的索引 false-不覆蓋原有的索引
                System.out.println("Indexing to directory '" + INDEX_DIR + "'...");
                indexDocs(writer, docDir);
                System.out.println("Optimizing...");
                writer.optimize();
                writer.close();

                Date end = new Date();
                System.out.println(end.getTime() - start.getTime() +
                                   " total milliseconds");

              }
              catch (IOException e) {
                System.out.println(" caught a " + e.getClass() +
                                   "\n with message: " + e.getMessage());
              }

            }

            //Clean up resources
            public void destroy() {
            }

            public void indexDocs(IndexWriter writer, File file) throws IOException {
              // do not try to index files that cannot be read
              int index = 0;
              String filehouzui = "";
              index = file.getName().indexOf(".");
              //strFileName = strFileName.substring(0, index) +DateUtil.getCurrDateTime() + "." + strFileName.substring(index + 1);
              filehouzui = file.getName().substring(index + 1);

              if (file.canRead()) {
                if (file.isDirectory()) {
                  String[] files = file.list();
                  // an IO error could occur
                  if (files != null) {
                    for (int i = 0; i < files.length; i++) {
                      indexDocs(writer, new File(file, files[i]));
                    }
                  }
                }
                else {
                  System.out.println("adding " + file);
                  try {
                    if (filehouzui.equals("doc")) {
                      writer.addDocument(getWordDocument(file, new FileInputStream(file)));
                    }
                    else if (filehouzui.equals("txt")) {
                      writer.addDocument(getTxtDocument(file, new FileInputStream(file)));
                    }
                    else if (filehouzui.equals("xls")) {
                      writer.addDocument(getExcelDocument(file, new FileInputStream(file)));
                    }
                    //writer.addDocument(parseFile(file));

                    //writer.addDocument(FileDocument.Document(file));//path 存放文件的相對路徑
                  }
                  // at least on windows, some temporary files raise this exception with an "access denied" message
                  // checking if the file can be read doesn't help
                  catch (Exception fnfe) {
                    ;
                  }
                }
              }
            }

            /**
             *@paramfile
             *
             *把File變成Document
             */
            public Document parseFile(File file) throws Exception {
              Document doc = new Document();
              doc.add(new Field("path", file.getAbsolutePath(), Field.Store.YES,
                                Field.Index.UN_TOKENIZED)); //取文件的絕對路徑
              try {
                doc.add(new Field("contents", new FileReader(file))); //索引文件內(nèi)容
                doc.add(new Field("title", file.getName(), Field.Store.YES,
                                  Field.Index.UN_TOKENIZED));
                //索引最后修改時間
                doc.add(new Field("modified",
                                  String.valueOf(DateFormat.
                                                 getDateTimeInstance().format(new
                    Date(file.lastModified()))), Field.Store.YES,
                                  Field.Index.UN_TOKENIZED));
                //doc.removeField("title");
              }
              catch (Exception e) {
                e.printStackTrace();
              }
              return doc;
            }

           
            /**
             *@paramfile
             *
             *使用POI讀取word文檔
             * 不太好用,讀取word文檔不全
             */
            public Document getDocument(File file, FileInputStream is) throws Exception {
              String bodyText = null;
              try {
                WordDocument wd = new WordDocument(is);
                StringWriter docTextWriter = new StringWriter();
                wd.writeAllText(new PrintWriter(docTextWriter));
                bodyText = docTextWriter.toString();
                docTextWriter.close();
                //   bodyText   =   new   WordExtractor().extractText(is);
                System.out.println("word content====" + bodyText);
              }
              catch (Exception e) {
                ;
              }
              if ( (bodyText != null)) {
                Document doc = new Document();
                doc.add(new Field("path", file.getAbsolutePath(), Field.Store.YES,
                                  Field.Index.UN_TOKENIZED)); //取文件的絕對路徑
                doc.add(new Field("contents", bodyText, Field.Store.YES,
                                  Field.Index.TOKENIZED));
                return doc;
              }
              return null;
            }

            //Document   doc   =   getDocument(new   FileInputStream(new   File(file)));
            /**
             *@paramfile
             *
             *使用tm-extractors-0.4.jar讀取word文檔
             * 好用
             */
            public Document getWordDocument(File file, FileInputStream is) throws
                Exception {
              String bodyText = null;
              try {
                WordExtractor extractor = new WordExtractor();
                System.out.println("word文檔");
                bodyText = extractor.extractText(is);
                if ( (bodyText != null)) {
                  Document doc = new Document();
                  doc.add(new Field("path", file.getAbsolutePath(), Field.Store.YES,
                                    Field.Index.UN_TOKENIZED)); //取文件的絕對路徑
                  doc.add(new Field("contents", bodyText, Field.Store.YES,
                                    Field.Index.TOKENIZED));
                  System.out.println("word content====" + bodyText);
                  return doc;
                }
              }
              catch (Exception e) {
                ;
              }
              return null;
            }

            /**
             *@paramfile
             *
             *讀取TXT文檔
             */
            public Document getTxtDocument(File file, FileInputStream is) throws
                Exception {
              try {
                Reader textReader = new FileReader(file);
                Document doc = new Document();
                doc.add(new Field("path", file.getAbsolutePath(), Field.Store.YES,
                                  Field.Index.UN_TOKENIZED)); //取文件的絕對路徑
                doc.add(new Field("contents", textReader));
                return doc;
              }
              catch (Exception e) {
                ;
              }
              return null;
            }

            /**
             * 使用POI讀取Excel文件
             * @param file File
             * @param is FileInputStream
             * @throws Exception
             * @return Document
             */
            public Document getExcelDocument(File file, FileInputStream is) throws
                Exception {
              String bodyText = "";
              try {
                System.out.println("讀取excel文件");
                ExcelReader er = new ExcelReader(file.getAbsolutePath());
                bodyText = er.readLine();
                int rows = 0;
                rows = er.getRows();
                for (int i = 0; i < rows; i++) {
                  bodyText = bodyText + er.readLine();
                  System.out.println("bodyText===" + bodyText);
                }
                Document doc = new Document();
                doc.add(new Field("path", file.getAbsolutePath(), Field.Store.YES,
                                  Field.Index.UN_TOKENIZED)); //取文件的絕對路徑
                doc.add(new Field("contents", bodyText, Field.Store.YES,
                                  Field.Index.TOKENIZED));
                System.out.println("word content====" + bodyText);
                return doc;
              }
              catch (Exception e) {
                System.out.println(e);
              }
              return null;
            }
          }


           

          package searchfileexample;

          import javax.servlet.*;
          import javax.servlet.http.*;
          import java.io.*;
          import java.util.*;

          import org.apache.lucene.analysis.Analyzer;
          import org.apache.lucene.analysis.standard.StandardAnalyzer;
          import org.apache.lucene.document.Document;
          import org.apache.lucene.index.FilterIndexReader;
          import org.apache.lucene.index.IndexReader;
          import org.apache.lucene.queryParser.QueryParser;
          import org.apache.lucene.search.Hits;
          import org.apache.lucene.search.IndexSearcher;
          import org.apache.lucene.search.Query;
          import org.apache.lucene.search.Searcher;

          import java.io.BufferedReader;
          import java.io.FileReader;
          import java.io.IOException;
          import java.io.InputStreamReader;
          import java.util.Date;
          import org.apache.lucene.queryParser.*;

          public class SearchFileServlet
              extends HttpServlet {
            private static final String CONTENT_TYPE = "text/html; charset=GBK";

            //Initialize global variables
            public void init() throws ServletException {
            }

            /** Use the norms from one field for all fields.  Norms are read into memory,
             * using a byte of memory per document per searched field.  This can cause
             * search of large collections with a large number of fields to run out of
             * memory.  If all of the fields contain only a single token, then the norms
             * are all identical, then single norm vector may be shared. */
            private static class OneNormsReader
                extends FilterIndexReader {
              private String field;

              public OneNormsReader(IndexReader in, String field) {
                super(in);
                this.field = field;
              }

              public byte[] norms(String field) throws IOException {
                return in.norms(this.field);
              }
            }

            //Process the HTTP Get request
            public void service(HttpServletRequest request, HttpServletResponse response) throws
                ServletException, IOException {
              response.setContentType(CONTENT_TYPE);
              PrintWriter out = response.getWriter();

              String[] args = {
                  "a", "b"};
              String usage =
                  "Usage: java org.apache.lucene.demo.SearchFiles [-index dir] [-field f] [-repeat n] [-queries file] [-raw] [-norms field]";
              if (args.length > 0 && ("-h".equals(args[0]) || "-help".equals(args[0]))) {
                System.out.println(usage);
                System.exit(0);
              }

              String index = "index"; //該值是用來存放生成的索引文件的文件夾的名稱,不能改動
              String field = "contents"; //不能修改  field  的值
              String queries = null; //是用來存放需要檢索的關(guān)鍵字的一個文件。
              queries = "D:/lfy_programe/全文檢索/SearchFileExample/aa.txt";
              System.out.println("-----------------------" + request.getContextPath());
              int repeat = 1;
              boolean raw = false;
              String normsField = null;

              for (int i = 0; i < args.length; i++) {
                if ("-index".equals(args[i])) {
                  index = args[i + 1];
                  i++;
                }
                else if ("-field".equals(args[i])) {
                  field = args[i + 1];
                  i++;
                }
                else if ("-queries".equals(args[i])) {
                  queries = args[i + 1];
                  i++;
                }
                else if ("-repeat".equals(args[i])) {
                  repeat = Integer.parseInt(args[i + 1]);
                  i++;
                }
                else if ("-raw".equals(args[i])) {
                  raw = true;
                }
                else if ("-norms".equals(args[i])) {
                  normsField = args[i + 1];
                  i++;
                }
              }

              IndexReader reader = IndexReader.open(index);

              if (normsField != null) {
                reader = new OneNormsReader(reader, normsField);

              }
              Searcher searcher = new IndexSearcher(reader); //用來打開索引文件
              Analyzer analyzer = new StandardAnalyzer(); //分析器
              //Analyzer analyzer = new StandardAnalyzer();

              BufferedReader in = null;
              if (queries != null) {
                in = new BufferedReader(new FileReader(queries));
              }
              else {
                in = new BufferedReader(new InputStreamReader(System.in, "UTF-8"));
              }
              QueryParser parser = new QueryParser(field, analyzer);

              out.println("<html>");
              out.println("<head><title>SearchFileServlet</title></head>");
              out.println("<body bgcolor=\"#ffffff\">");

              while (true) {
                if (queries == null) { // prompt the user
                  System.out.println("Enter query: ");

                }
                String line = in.readLine(); //組成查詢關(guān)鍵字字符串
                System.out.println("查詢字符串===" + line);

                if (line == null || line.length() == -1) {
                  break;
                }

                line = line.trim();
                if (line.length() == 0) {
                  break;
                }

                Query query = null;
                try {
                  query = parser.parse(line);
                }
                catch (ParseException ex) {
                }
                System.out.println("Searching for: " + query.toString(field)); //每個關(guān)鍵字

                Hits hits = searcher.search(query);

                if (repeat > 0) { // repeat & time as benchmark
                  Date start = new Date();
                  for (int i = 0; i < repeat; i++) {
                    hits = searcher.search(query);
                  }
                  Date end = new Date();
                  System.out.println("Time: " + (end.getTime() - start.getTime()) + "ms");
                }
                out.println("<p>查詢到:" + hits.length() + "個含有[" +
                            query.toString(field) + "]的文檔</p>");

                System.out.println("查詢到:" + hits.length() + " 個含有 [" +
                                   query.toString(field) + "]的文檔");

                final int HITS_PER_PAGE = 10; //查詢返回的最大記錄數(shù)
                int currentNum = 5; //當(dāng)前記錄數(shù)

                for (int start = 0; start < hits.length(); start += HITS_PER_PAGE) {
                  //start = start + currentNum;
                  int end = Math.min(hits.length(), start + HITS_PER_PAGE);

                  for (int i = start; i < end; i++) {

                    //if (raw) {                              // output raw format
                    System.out.println("doc=" + hits.id(i) + " score=" + hits.score(i)); //score是接近度的意思
                    //continue;
                    //}

                    Document doc = hits.doc(i);
                    String path = doc.get("path");

                    if (path != null) {
                      System.out.println( (i + 1) + ". " + path);
                      out.println("<p>" + (i + 1) + ". " + path + "</p>");
                      String title = doc.get("title");
                      System.out.println("   modified: " + doc.get("modified"));
                      if (title != null) {
                        System.out.println("   Title: " + doc.get("title"));
                      }
                    }
                    else {
                      System.out.println( (i + 1) + ". " + "No path for this document");
                    }
                  }

                  if (queries != null) { // non-interactive
                    break;
                  }

                  if (hits.length() > end) {
                    System.out.println("more (y/n) ? ");
                    line = in.readLine();
                    if (line.length() == 0 || line.charAt(0) == 'n') {
                      break;
                    }
                  }
                }
              }
              reader.close();

              out.println("</body></html>");
            }

          //Clean up resources
            public void destroy() {
            }
          }


           

           

          posted @ 2008-03-19 16:52 軒轅 閱讀(924) | 評論 (0)編輯 收藏

          全文檢索

          package searchfileexample;

          import org.apache.lucene.analysis.standard.StandardAnalyzer;
          import org.apache.lucene.index.IndexWriter;

          import java.io.File;
          import java.io.FileNotFoundException;
          import java.io.IOException;
          import java.util.Date;
          import org.apache.lucene.demo.FileDocument;
          import org.apache.lucene.document.Document;
          import org.apache.lucene.document.Field;
          import java.io.FileReader;
          import org.apache.lucene.index.*;
          import java.text.DateFormat;
          import org.apache.poi.hdf.extractor.WordDocument;
          import java.io.InputStream;
          import java.io.StringWriter;
          import java.io.PrintWriter;
          import java.io.FileInputStream;
          import java.io.*;
          import org.textmining.text.extraction.WordExtractor;

          /**
           * 給某個目錄下的所有文件生成索引
           * <p>Title: </p>
           * <p>Description: </p>
           * <p>Copyright: Copyright (c) 2007</p>
           * <p>Company: </p>
           * @author not attributable
           * @version 1.0
           * 根據(jù)文件的不同,可以把索引文件創(chuàng)建到不同的文件夾下去,這樣可以分類保存索引信息。
           */

          /** Index all text files under a directory. */
          public class IndexFiles {

            private IndexFiles() {}

            static final File INDEX_DIR = new File("index");

            /** Index all text files under a directory. */
            public static void main(String[] args) {
              String usage = "java org.apache.lucene.demo.IndexFiles <root_directory>";
              //String[] arg = {"a","b"};
              //System.out.println(arg[0]);
              /*
                   if (args.length == 0) {
                System.err.println("Usage: " + usage);
                System.exit(1);
                   }*/
              /*
                  if (INDEX_DIR.exists()) {
                    System.out.println("Cannot save index to '" +INDEX_DIR+ "' directory, please delete it first");
                    System.exit(1);
                  }*/

              final File docDir = new File("a"); //需要生成索引的文件的文件夾
              if (!docDir.exists() || !docDir.canRead()) {
                System.out.println("Document directory '" + docDir.getAbsolutePath() +
                                   "' does not exist or is not readable, please check the path");
                System.exit(1);
              }

              Date start = new Date();
              try {
                IndexWriter writer = new IndexWriter(INDEX_DIR, new StandardAnalyzer(), true); //true-覆蓋原有的索引 false-不覆蓋原有的索引
                System.out.println("Indexing to directory '" + INDEX_DIR + "'...");
                indexDocs(writer, docDir);
                System.out.println("Optimizing...");
                writer.optimize();
                writer.close();

                Date end = new Date();
                System.out.println(end.getTime() - start.getTime() +
                                   " total milliseconds");

              }
              catch (IOException e) {
                System.out.println(" caught a " + e.getClass() +
                                   "\n with message: " + e.getMessage());
              }
            }

            static void indexDocs(IndexWriter writer, File file) throws IOException {
              // do not try to index files that cannot be read
              if (file.canRead()) {
                if (file.isDirectory()) {
                  String[] files = file.list();
                  // an IO error could occur
                  if (files != null) {
                    for (int i = 0; i < files.length; i++) {
                      indexDocs(writer, new File(file, files[i]));
                    }
                  }
                }
                else {
                  System.out.println("adding " + file);
                  try {

                    writer.addDocument(getDocument2(file, new FileInputStream(file)));
                    //writer.addDocument(parseFile(file));

                    //writer.addDocument(FileDocument.Document(file));//path 存放文件的相對路徑
                  }
                  // at least on windows, some temporary files raise this exception with an "access denied" message
                  // checking if the file can be read doesn't help
                  catch (Exception fnfe) {
                    ;
                  }
                }
              }
            }

            /**
             *@paramfile
             *
             *把File變成Document
             */
            static Document parseFile(File file) throws Exception {
              Document doc = new Document();
              doc.add(new Field("path", file.getAbsolutePath(), Field.Store.YES,
                                Field.Index.UN_TOKENIZED)); //取文件的絕對路徑
              try {
                doc.add(new Field("contents", new FileReader(file))); //索引文件內(nèi)容
                doc.add(new Field("title", file.getName(), Field.Store.YES,
                                  Field.Index.UN_TOKENIZED));
                //索引最后修改時間
                doc.add(new Field("modified",
                                  String.valueOf(DateFormat.
                                                 getDateTimeInstance().format(new
                    Date(file.lastModified()))), Field.Store.YES,
                                  Field.Index.UN_TOKENIZED));
                //doc.removeField("title");
              }
              catch (Exception e) {
                e.printStackTrace();
              }
              return doc;
            }

            /**
             *@paramfile
             *
             *轉(zhuǎn)換word文檔

                   static String changeWord(File file) throws Exception {
              String re = "";
              try {
                WordDocument wd = new WordDocument(is);
                  StringWriter docTextWriter = new StringWriter();
                  wd.writeAllText(new PrintWriter(docTextWriter));
                  docTextWriter.close();
                  bodyText = docTextWriter.toString();

              } catch (Exception e) {
                  e.printStackTrace();
              }
              return re;
                   }*/
            /**
             *@paramfile
             *
             *使用POI讀取word文檔
             */
            static Document getDocument(File file, FileInputStream is) throws Exception {

              String bodyText = null;

              try {

                //BufferedReader wt = new BufferedReader(new InputStreamReader(is));
                //bodyText = wt.readLine();
                //System.out.println("word ===="+bodyText);

                WordDocument wd = new WordDocument(is);
                StringWriter docTextWriter = new StringWriter();
                wd.writeAllText(new PrintWriter(docTextWriter));
                bodyText = docTextWriter.toString();
                docTextWriter.close();
                //   bodyText   =   new   WordExtractor().extractText(is);
                System.out.println("word content====" + bodyText);
              }
              catch (Exception e) {
                ;

              }

              if ( (bodyText != null)) {
                Document doc = new Document();
                doc.add(new Field("path", file.getAbsolutePath(), Field.Store.YES,
                                  Field.Index.UN_TOKENIZED)); //取文件的絕對路徑
                doc.add(new Field("contents", bodyText, Field.Store.YES,
                                  Field.Index.TOKENIZED));

                return doc;
              }
              return null;
            }

            //Document   doc   =   getDocument(new   FileInputStream(new   File(file)));
            /**
             *@paramfile
             *
             *使用tm-extractors-0.4.jar讀取word文檔
             */
            static Document getDocument2(File file, FileInputStream is) throws Exception {

              String bodyText = null;

              try {

                //FileInputStream in = new FileInputStream("D:/lfy_programe/全文檢索/SearchFileExample/a/aa.doc");
                //  FileInputStream in = new FileInputStream ("D:/szqxjzhbase/技術(shù)測試/新建 Microsoft Word 文檔.doc");
                WordExtractor extractor = new WordExtractor();
                System.out.println(is.available());

                bodyText = extractor.extractText(is);

          //    System.out.println("the result length is"+str.length());
                System.out.println("word content===="+bodyText);

              }
              catch (Exception e) {
                ;

              }

              if ( (bodyText != null)) {
                Document doc = new Document();
                doc.add(new Field("path", file.getAbsolutePath(), Field.Store.YES,
                                  Field.Index.UN_TOKENIZED)); //取文件的絕對路徑
                doc.add(new Field("contents", bodyText, Field.Store.YES,
                                  Field.Index.TOKENIZED));

                return doc;
              }
              return null;
            }

          }


           

          package searchfileexample;


          import org.apache.lucene.analysis.Analyzer;
          import org.apache.lucene.analysis.standard.StandardAnalyzer;
          import org.apache.lucene.document.Document;
          import org.apache.lucene.index.FilterIndexReader;
          import org.apache.lucene.index.IndexReader;
          import org.apache.lucene.queryParser.QueryParser;
          import org.apache.lucene.search.Hits;
          import org.apache.lucene.search.IndexSearcher;
          import org.apache.lucene.search.Query;
          import org.apache.lucene.search.Searcher;


          import java.io.BufferedReader;
          import java.io.FileReader;
          import java.io.IOException;
          import java.io.InputStreamReader;
          import java.util.Date;
          import org.apache.lucene.analysis.SimpleAnalyzer;
          import org.apache.lucene.analysis.KeywordAnalyzer;
          import org.apache.lucene.analysis.WhitespaceAnalyzer;
          import org.apache.lucene.document.Fieldable;

          /** Simple command-line based search demo. */
          public class SearchFiles {

            /** Use the norms from one field for all fields.  Norms are read into memory,
             * using a byte of memory per document per searched field.  This can cause
             * search of large collections with a large number of fields to run out of
             * memory.  If all of the fields contain only a single token, then the norms
             * are all identical, then single norm vector may be shared. */
            private static class OneNormsReader extends FilterIndexReader {
              private String field;

              public OneNormsReader(IndexReader in, String field) {
                super(in);
                this.field = field;
              }

              public byte[] norms(String field) throws IOException {
                return in.norms(this.field);
              }
            }

            private SearchFiles() {}

            /** Simple command-line based search demo. */
            public static void main(String[] arg) throws Exception {
              String[] args = {"a","b"};
              String usage =
                "Usage: java org.apache.lucene.demo.SearchFiles [-index dir] [-field f] [-repeat n] [-queries file] [-raw] [-norms field]";
              if (args.length > 0 && ("-h".equals(args[0]) || "-help".equals(args[0]))) {
                System.out.println(usage);
                System.exit(0);
              }

              String index = "index";//該值是用來存放生成的索引文件的文件夾的名稱,不能改動
              String field = "contents";//不能修改  field  的值
              String queries = null;//是用來存放需要檢索的關(guān)鍵字的一個文件。
              queries = "D:/lfy_programe/全文檢索/SearchFileExample/aa.txt";

              int repeat = 1;
              boolean raw = false;
              String normsField = null;

              for (int i = 0; i < args.length; i++) {
                if ("-index".equals(args[i])) {
                  index = args[i+1];
                  i++;
                } else if ("-field".equals(args[i])) {
                  field = args[i+1];
                  i++;
                } else if ("-queries".equals(args[i])) {
                  queries = args[i+1];
                  i++;
                } else if ("-repeat".equals(args[i])) {
                  repeat = Integer.parseInt(args[i+1]);
                  i++;
                } else if ("-raw".equals(args[i])) {
                  raw = true;
                } else if ("-norms".equals(args[i])) {
                  normsField = args[i+1];
                  i++;
                }
              }

              IndexReader reader = IndexReader.open(index);

              if (normsField != null)
                reader = new OneNormsReader(reader, normsField);

              Searcher searcher = new IndexSearcher(reader);//用來打開索引文件
              Analyzer analyzer = new StandardAnalyzer();//分析器
              //Analyzer analyzer = new StandardAnalyzer();

              BufferedReader in = null;
              if (queries != null) {
                in = new BufferedReader(new FileReader(queries));
              } else {
                in = new BufferedReader(new InputStreamReader(System.in, "UTF-8"));
              }
                QueryParser parser = new QueryParser(field, analyzer);
              while (true) {
                if (queries == null)                        // prompt the user
                  System.out.println("Enter query: ");

                String line = in.readLine();//組成查詢關(guān)鍵字字符串
                System.out.println("查詢字符串==="+line);

                if (line == null || line.length() == -1)
                  break;

                line = line.trim();
                if (line.length() == 0)
                  break;

                Query query = parser.parse(line);
                System.out.println("Searching for: " + query.toString(field));//每個關(guān)鍵字

                Hits hits = searcher.search(query);

                if (repeat > 0) {                           // repeat & time as benchmark
                  Date start = new Date();
                  for (int i = 0; i < repeat; i++) {
                    hits = searcher.search(query);
                  }
                  Date end = new Date();
                  System.out.println("Time: "+(end.getTime()-start.getTime())+"ms");
                }

                System.out.println("查詢到:" + hits.length() + " 個含有 ["+query.toString(field)+"]的文檔");

                final int HITS_PER_PAGE = 10;//查詢返回的最大記錄數(shù)
                int currentNum = 2;//當(dāng)前記錄數(shù)
                for (int start = 0; start < hits.length(); start += HITS_PER_PAGE) {
                  //start = start + currentNum;
                  int end = Math.min(hits.length(), start + HITS_PER_PAGE);
                  for (int i = start; i < end; i++) {

                    //if (raw) {                              // output raw format
                      System.out.println("doc="+hits.id(i)+" score="+hits.score(i));//score是接近度的意思
                      //continue;
                    //}

                    Document doc = hits.doc(i);
                    String path = doc.get("path");


                    if (path != null) {
                      System.out.println((i+1) + ". " + path);
                      String title = doc.get("title");
                      System.out.println("   modified: " + doc.get("modified"));
                      if (title != null) {
                        System.out.println("   Title: " + doc.get("title"));
                      }
                    } else {
                      System.out.println((i+1) + ". " + "No path for this document");
                    }
                  }

                  if (queries != null)                      // non-interactive
                    break;

                  if (hits.length() > end) {
                    System.out.println("more (y/n) ? ");
                    line = in.readLine();
                    if (line.length() == 0 || line.charAt(0) == 'n')
                      break;
                  }
                }
              }
              reader.close();
            }
          }


           

          package searchfileexample;

          import javax.servlet.*;
          import javax.servlet.http.*;
          import java.io.*;
          import java.util.*;
          import org.textmining.text.extraction.WordExtractor;

          public class ReadWord extends HttpServlet {
            private static final String CONTENT_TYPE = "text/html; charset=GBK";

            //Initialize global variables
            public void init() throws ServletException {
            }

            //Process the HTTP Get request
            public void doGet(HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException {
              response.setContentType(CONTENT_TYPE);
              FileInputStream in = new FileInputStream ("D:/lfy_programe/全文檢索/SearchFileExample/a/aa.doc");
                 //  FileInputStream in = new FileInputStream ("D:/szqxjzhbase/技術(shù)測試/新建 Microsoft Word 文檔.doc");
             WordExtractor extractor = new WordExtractor();
             System.out.println(in.available());
            String str = null;
            try {
              str = extractor.extractText(in);
            }
            catch (Exception ex) {
            }
          //    System.out.println("the result length is"+str.length());
             System.out.println(str);

            }

            //Clean up resources
            public void destroy() {
            }
          }

          1.英文的模糊查詢問題
          查詢時的關(guān)鍵字的后邊加上通配符  " * " 就可以了。

          2.IndexFiles.java
          用來索引文件的java類

          3.SearchFiles.java
          用來搜索的java類

          4.ReadWord.java
          使用tm-extractors-0.4.jar來讀取word文件


           

           

          posted @ 2008-03-18 10:35 軒轅 閱讀(266) | 評論 (0)編輯 收藏

          使用tm-extractors-0.4.jar來讀取word文件

          package searchfileexample;

          import javax.servlet.*;
          import javax.servlet.http.*;
          import java.io.*;
          import java.util.*;
          import org.textmining.text.extraction.WordExtractor;

          public class ReadWord extends HttpServlet {
            private static final String CONTENT_TYPE = "text/html; charset=GBK";

            //Initialize global variables
            public void init() throws ServletException {
            }

            //Process the HTTP Get request
            public void doGet(HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException {
              response.setContentType(CONTENT_TYPE);
              FileInputStream in = new FileInputStream ("D:/lfy_programe/全文檢索/SearchFileExample/a/aa.doc");
                 //  FileInputStream in = new FileInputStream ("D:/szqxjzhbase/技術(shù)測試/新建 Microsoft Word 文檔.doc");
             WordExtractor extractor = new WordExtractor();
             System.out.println(in.available());
            String str = null;
            try {
              str = extractor.extractText(in);
            }
            catch (Exception ex) {
            }
          //    System.out.println("the result length is"+str.length());
             System.out.println(str);

            }

            //Clean up resources
            public void destroy() {
            }
          }

          posted @ 2008-03-18 10:33 軒轅 閱讀(5511) | 評論 (5)編輯 收藏

          掌控上傳進度的AJAX Upload

               摘要: 掌控上傳進度的AJAX Upload cleverpig 發(fā)表于 2007-01-08 11:12:14作者:cleverpig     來源:Matrix評論數(shù):83 點擊數(shù):5,066     投票總得分:12 投票總?cè)舜?4關(guān)鍵字:AJAX,upload,monitor ...  閱讀全文

          posted @ 2007-08-07 17:02 軒轅 閱讀(2492) | 評論 (1)編輯 收藏

          ajax 上傳文件

          http://www.matrix.org.cn/resource/article/2007-01-08/09db6d69-9ec6-11db-ab77-2bbe780ebfbf.html

          posted @ 2007-08-07 16:54 軒轅 閱讀(223) | 評論 (0)編輯 收藏

          主站蜘蛛池模板: 大理市| 登封市| 贵德县| 金堂县| 博客| 孟州市| 荥阳市| 兰西县| 龙海市| 新乡县| 南宫市| 富宁县| 正安县| 吴江市| 柏乡县| 隆德县| 垦利县| 临夏县| 鸡西市| 安乡县| 临潭县| 大连市| 灵川县| 岐山县| 五常市| 新余市| 灌阳县| 洛川县| 白朗县| 丰台区| 永济市| 呼和浩特市| 石阡县| 松阳县| 确山县| 调兵山市| 吴堡县| 恩施市| 榆树市| 留坝县| 胶南市|