willpower88
對JAVA有點理解了……

隨筆-204 評論-90 文章-8 trackbacks-0

　　由于lucene2.0+heritrix一書示例用的網站(http://mobile.pconline.com.cn/,http: //mobile.163.com/)改版了，書上實例不能運行，我又找了一個http://mobile.younet.com/進行開發并成功實現示例，希望感興趣的同學，近快實踐，如果此網站也改了就又得改extractor了，哈哈！
search的Extractor代碼如下，（別和書上實例相同）供大家參考：附件里有完整代碼

package com.luceneheritrixbook.extractor.younet;

import java.io.BufferedWriter;
import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
import java.util.Date;

import org.htmlparser.Node;
import org.htmlparser.NodeFilter;
import org.htmlparser.Parser;
import org.htmlparser.filters.AndFilter;
import org.htmlparser.filters.HasAttributeFilter;
import org.htmlparser.filters.HasChildFilter;
import org.htmlparser.filters.TagNameFilter;
import org.htmlparser.tags.ImageTag;
import org.htmlparser.util.NodeIterator;
import org.htmlparser.util.NodeList;

import com.luceneheritrixbook.extractor.Extractor;
import com.luceneheritrixbook.util.StringUtils;

/**
* <p></p>
* @author cnyqiao@hotmail.com
* @date   Feb 6, 2009
*/

public class ExtractYounetMoblie extends Extractor {

    @Override
    public void extract() {
        BufferedWriter bw = null;
        NodeFilter title_filter = new AndFilter(new TagNameFilter("div"), new HasAttributeFilter("class", "mo_tit"));
        NodeFilter attribute_filter = new AndFilter(new TagNameFilter("p"), new HasChildFilter(new AndFilter(new TagNameFilter("span"), new HasAttributeFilter("class", "gn_sp1 blue1"))));
        NodeFilter img_filter = new AndFilter(new TagNameFilter("span"), new HasChildFilter(new TagNameFilter("img")));

        //提取標題信息
        try {
            //Parser根據過濾器返回所有滿足過濾條件的節點
            // 迭代逐漸查找
            NodeList nodeList=this.getParser().parse(title_filter);
            NodeIterator it = nodeList.elements();
            StringBuffer title = new StringBuffer();
            while (it.hasMoreNodes()) {
                Node node = (Node) it.nextNode();
                String[] names = node.toPlainTextString().split(" ");
                for(int i = 0; i < names.length; i++)
                    title.append(names[i]).append("-");
                title.append(new Date().getTime());
                //創建要生成的文件
                bw = new BufferedWriter(new FileWriter(new File(this.getOutputPath() + title + ".txt")));
                //獲取當前提取頁的完整URL地址
                int startPos = this.getInuputFilePath().indexOf("mirror") + 6;
                String url_seg = this.getInuputFilePath().substring(startPos);
                url_seg = url_seg.replaceAll("\\\\", "/");
                String url = "http:/" + url_seg;
                //寫入當前提取頁的完整URL地址
                bw.write(url + NEWLINE);
                bw.write(names[0] + NEWLINE);
                bw.write(names[1] + NEWLINE);

            }
            // 重置Parser
            this.getParser().reset();
            Parser attNameParser = null;
            Parser attValueParser = null;
            //Parser parser=new Parser("http://www.sina.com.cn");
            NodeFilter attributeName_filter = new AndFilter(new TagNameFilter("span"), new HasAttributeFilter("class", "gn_sp1 blue1"));
            NodeFilter attributeValue_filter = new AndFilter(new TagNameFilter("span"), new HasAttributeFilter("class", "gn_sp2"));
            String attName = "";
            String attValue = "";
            // 迭代逐漸查找
            nodeList=this.getParser().parse(attribute_filter);
            it = nodeList.elements();
            while (it.hasMoreNodes()) {
                Node node = (Node) it.nextNode();
                attNameParser = new Parser();
                attNameParser.setEncoding("GB2312");
                attNameParser.setInputHTML(node.toHtml());
                NodeList attNameNodeList = attNameParser.parse(attributeName_filter);
                attName = attNameNodeList.elements().nextNode().toPlainTextString();

                attValueParser = new Parser();
                attValueParser.setEncoding("GB2312");
                attValueParser.setInputHTML(node.toHtml());
                NodeList attValueNodeList = attValueParser.parse(attributeValue_filter);
                attValue = attValueNodeList.elements().nextNode().toPlainTextString();
                bw.write(attName.trim() + attValue.trim());
                bw.newLine();
            }
            // 重置Parser
            this.getParser().reset();
            String imgUrl = "";
            String fileType ="";
            // 迭代逐漸查找
            nodeList=this.getParser().parse(img_filter);
            it = nodeList.elements();
            while (it.hasMoreNodes()) {
                Node node = (Node) it.nextNode();

                ImageTag imgNode = (ImageTag)node.getChildren().elements().nextNode();
                imgUrl = imgNode.getAttribute("src");
                fileType = imgUrl.trim().substring(imgUrl
                        .lastIndexOf(".") + 1);
                //生成新的圖片的文件名
                String new_iamge_file = StringUtils.encodePassword(imgUrl, HASH_ALGORITHM) + "." + fileType;
                //imgUrl = new HtmlPaserFilterTest().replace(new_iamge_file, "+", " ");
                //利用miorr目錄下的圖片生成的新的圖片
                this.copyImage(imgUrl, new_iamge_file);
                bw.write(SEPARATOR + NEWLINE);
                bw.write(new_iamge_file + NEWLINE);
            }


        } catch(Exception e) {
            e.printStackTrace();
        } finally {
            try{
                if (bw != null)
                    bw.close();
            }catch(IOException e){
                e.printStackTrace();
            }
        }

    }
}

運行書上的heritrix實例，并按書上的默認設置進行抓取如下ＵＲＩ：（請自己分析整理）

http://mobile.younet.com/files/list_1.html
http://mobile.younet.com/files/list_2.html
http://mobile.younet.com/files/list_3.html

posted on 2009-02-09 15:44 一凡閱讀(2371) 評論(5) 編輯收藏所屬分類: 搜索

評論:

# re: lucene2.0+heritrix示例補充[未登錄] 2009-03-29 16:49 | lq

很好，強烈支持！回復更多評論

# re: lucene2.0+heritrix示例補充[未登錄] 2011-01-14 00:35 | aaaaa

謝謝回復更多評論

# re: lucene2.0+heritrix示例補充[未登錄] 2011-07-07 23:57 | 小龍

啊回復更多評論

# re: lucene2.0+heritrix示例補充[未登錄] 2011-07-08 00:01 | 小龍

看到這篇博客就想看到救命恩人一樣，樓主，我請你務必要告訴我，import com.luceneheritrixbook.util.StringUtils;
這到底是個什么包啊！我的確買了一本《開發自己的搜索引擎》我按照書上代碼敲，結果找不到這個包的，我懷疑這個自己寫的，不是源代碼，請樓主務必要和我聯系,幫我解答，謝謝了，我郁悶啊！！我的郵箱a542107840@qq.com QQ:542107840, 回復更多評論

# re: lucene2.0+heritrix示例補充 2011-07-08 11:47 | willpower88

我在javaeye的博客里放了源碼，你去下吧
http://willpower88.iteye.com/admin/blogs/325722 回復更多評論

新用戶注冊刷新評論列表


只有注冊用戶登錄后才能發表評論。




網站導航: 博客園 IT新聞 Chat2DB C++博客博問管理

<

2009年2月

>

日

一

二

三

四

五

六

25

26

27

28

29

30

31

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

1

2

3

4

5

6

7

常用鏈接

留言簿(9)

隨筆分類

隨筆檔案

文章分類

文章檔案

相冊

myphoto

學習資源

HSQL、mysql學習筆記
http://labs.renren.com/apache-mirror/
reren apache mirror
jadclipse
java內存溢出
Liferay Portal學習資料
linux command
Lucene-2.0學習文檔
my baby blog
oksonic
sean-Jakarta_Commons_Notes
張沈鵬
搜索引擎lucence
游戲開發
游戲開發
程序性能問題定位方法
苗偉的專欄
雪宇的BLOG