欧美日韩综合在线,亚洲香蕉在线观看,在线观看免费网站

MDA/MDD/TDD/DDD/DDDDDDD

posts - 536, comments - 111, trackbacks - 0, articles - 0

html的解析以及nekohtml的使用

Posted on 2008-02-21 18:29 leekiang 閱讀(2903) 評論(0) 編輯收藏所屬分類: 文件處理

??? import ?org.cyberneko.html.parsers.DOMFragmentParser;
???? import org.apache.html.dom.HTMLDocumentImpl;
???? import ?org.w3c.dom.DocumentFragment;
???? import ?org.w3c.dom.Node;
???? import ?org.w3c.dom.NodeList;
???? import ?org.xml.sax.InputSource;
???? import ?org.xml.sax.SAXException;

???? /**
?????*?從html中抽取純文本
?????*?
?????*? @param ?content
?????*? @return
?????*? @throws ?UnsupportedEncodingException
????? */
???? public ?String?extractTextFromHTML(String?content)
???????????? throws ?UnsupportedEncodingException?{
????????DOMFragmentParser?parser? = ? new ?DOMFragmentParser();
????????DocumentFragment?node? = ? new ?HTMLDocumentImpl().createDocumentFragment();
????????InputStream?is? = ? new ?ByteArrayInputStream(content.getBytes());
???????? try ?{
????????????parser.parse( new ?InputSource(is),?node);
????????}? catch ?(IOException?e)?{
????????????e.printStackTrace();
????????}? catch ?(SAXException?se)?{
????????????se.printStackTrace();
????????}

????????StringBuffer?newContent? = ? new ?StringBuffer();
???????? this .getText(newContent,?node);

????????String?str? = ?( new ?String(
????????????????newContent.toString().getBytes( " Windows-1252 " ),? " GBK " ));
???????? return ?str;
????}

???? private ? void ?getText(StringBuffer?sb,?Node?node)?{
???????? if ?(node.getNodeType()? == ?Node.TEXT_NODE)?{
????????????sb.append(node.getNodeValue());
????????}
????????NodeList?children? = ?node.getChildNodes();
???????? if ?(children? != ? null )?{
???????????? int ?len? = ?children.getLength();
???????????? for ?( int ?i? = ? 0 ;?i? < ?len;?i ++ )?{
????????????????getText(sb,?children.item(i));
????????????}
????????}
????}

1，nekohtml1.9.6.1版本用到了jdk5的Arrays.hashCode等方法，為兼容jdk1.4,
? 故采用nekohtml1.9.6版本
2，需要xerces.jar支持
3，
?? http://hi.baidu.com/walkandsing/blog/item/f5743634c6ba2e3a5bb5f5e5.html
?? http://blog.csdn.net/zhou2002/archive/2008/01/19/2053911.aspx
?? http://playfish.javaeye.com/blog/150184

4,
python解析html
http://lenciel.cn/docs/python-parser-of-xml/
http://hi.baidu.com/javalang/blog/item/84bac4bf731fb80f18d81fe1.html
ruby用hpricot

新用戶注冊刷新評論列表


只有注冊用戶登錄后才能發(fā)表評論。




網(wǎng)站導(dǎo)航: 博客園 IT新聞 Chat2DB C++博客博問管理
相關(guān)文章: 正則表達式筆記 dom4j用法純文本文檔格式 POI處理Excel公式的亂碼問題用poi生成鏈接 YAML格式解析 vCalendar(ics)，vCard格式及Outlook poi筆記備忘 html的解析以及nekohtml的使用

html的解析以及nekohtml的使用

公告

常用鏈接

留言簿(19)

隨筆分類(572)

隨筆檔案(536)

收藏

搜索

積分與排名

最新評論

閱讀排行榜

評論排行榜