華夢行
專注于java

隨筆-295 評論-26 文章-1 trackbacks-0

最近發現用htmlparser解析一些網頁時,繁體中文會變成亂碼.分析了下原因,發現在用stringbean的時候htmlparser會自己根據meta來決定用哪種內碼來解碼,而有的網站在meta中是用gb2312來做charset,實際應用的時候又用到了gbk.gb2312是不能表示繁體的,所以就出現了亂碼.解決的辦法很簡單,gbk是兼容gb2312的,所以在htmlparser的page.java的getcharser()那里加一句判斷,如果ret是gb2312就設置為gbk,這樣問題就解決了.?

修改的page.java的代碼如下(/lexer/page.java)

??? public String getCharset (String content)
??? {
??????? final String CHARSET_STRING = "charset";
??????? int index;
??????? String ret;

??????? if (null == mSource)
??????????? ret = DEFAULT_CHARSET;
??????? else
??????????? // use existing (possibly supplied) character set:
??????????? // bug #1322686 when illegal charset specified
??????????? ret = mSource.getEncoding ();
??????? if (null != content)
??????? {
??????????? index = content.indexOf (CHARSET_STRING);

??????????? if (index != -1)
??????????? {
??????????????? content = content.substring (index +
??????????????????? CHARSET_STRING.length ()).trim ();
??????????????? if (content.startsWith ("="))
??????????????? {
??????????????????? content = content.substring (1).trim ();
??????????????????? index = content.indexOf (";");
??????????????????? if (index != -1)
??????????????????????? content = content.substring (0, index);

??????????????????? //remove any double quotes from around charset string
??????????????????? if (content.startsWith ("\"") && content.endsWith ("\"")
??????????????????????? && (1 < content.length ()))
??????????????????????? content = content.substring (1, content.length () - 1);

??????????????????? //remove any single quote from around charset string
??????????????????? if (content.startsWith ("'") && content.endsWith ("'")
??????????????????????? && (1 < content.length ()))
??????????????????????? content = content.substring (1, content.length () - 1);

??????????????????? ret = findCharset (content, ret);

??????????????????? // Charset names are not case-sensitive;
??????????????????? // that is, case is always ignored when comparing
??????????????????? // charset names.
//??????????????????? if (!ret.equalsIgnoreCase (content))
//??????????????????? {
//??????????????????????? System.out.println (
//??????????????????????????? "detected charset \""
//??????????????????????????? + content
//??????????????????????????? + "\", using \""
//??????????????????????????? + ret
//??????????????????????????? + "\"");
//??????????????????? }
??????????????? }
??????????? }
??????? }
??????? if(ret.equalsIgnoreCase("gb2312"))ret="GBK"; //to avoid decode problem
??????????????????????????????????????????????????????????????????????????????????????? ? ?//edited by linyunfan
??????? return (ret);
??? }

?

在最后加入了這句

??????? if(ret.equalsIgnoreCase("gb2312"))ret="GBK";

大盤預測國富論

posted on 2008-10-09 13:33 華夢行閱讀(1776) 評論(3) 編輯收藏

評論:

# re: htmlparser解析一些網頁時,繁體中文會變成亂碼 2008-12-28 22:01 | 繁體

.。。回復更多評論

# re: htmlparser解析一些網頁時,繁體中文會變成亂碼 2009-03-11 15:45 | pnut

不錯。完全可以！
用parser.setEncoding("gbk")是不起作用的，程序運行時會用所抓網頁的“charset”覆蓋的。
gbk是gb2312的超集，所以用gbk去處理gb2312的網頁完全沒有問題，事實上，眾多網頁上所寫的gb2312是不嚴謹的，IE也總是會忽略它而用gbk去展示的。但目前現狀是很多網站都在寫gb2312，用博主的辦法處理是個好主意。
補充一下，Page.java在httplexer.jar里，源碼可以去“http://sourceforge.net/projects/htmlparser/” 下載回復更多評論

# re: htmlparser解析一些網頁時,繁體中文會變成亂碼[未登錄] 2011-08-07 15:53 | 小武

我改成樓主說的那樣了，還是不得呢？
QQ:1161008015 可以加我qq交流一下嘛。回復更多評論

新用戶注冊刷新評論列表


只有注冊用戶登錄后才能發表評論。




網站導航: 博客園 IT新聞 Chat2DB C++博客博問管理

常用鏈接

留言簿(2)

隨筆分類(91)

隨筆檔案(293)

友情鏈接

最新隨筆

搜索

積分與排名

最新評論

閱讀排行榜

評論排行榜