javajohn

          金色年華

          漢字(中文)還是unicode

          漢字與 unicode 編碼相互轉(zhuǎn)化

          (2006年7月17日? 11:07:58 )

          一、???????????? 概述:

          ?????? 如果項(xiàng)目采用了 GBK 的編碼,那么漢字轉(zhuǎn)化就不是問題了。但是如果采用了 utf-8 的編碼,漢字的處理就相對比較麻煩一些。

          二、???????????? 功能實(shí)現(xiàn):

          ??????

          代碼如下:

          ?

          ?1 ???? // ?轉(zhuǎn)為unicode
          ?2 ???? public ? static ? void ?writeUnicode( final ?DataOutputStream?out,
          ?3 ???????????? final ?String?value)? {
          ?4 ???????? try ? {
          ?5 ???????????? final ?String?unicode? = ?gbEncoding(value);
          ?6 ???????????? final ? byte []?data? = ?unicode.getBytes();
          ?7 ???????????? final ? int ?dataLength? = ?data.length;
          ?8
          ?9 ????????????System.out.println( " Data?Length?is:? " ? + ?dataLength);
          10 ????????????System.out.println( " Data?is:? " ? + ?value);
          11 ????????????out.writeInt(dataLength);? // ?先寫出字符串的長度
          12 ????????????out.write(data,? 0 ,?dataLength);? // ?然后寫出轉(zhuǎn)化后的字符串
          13 ????????}
          ? catch ?(IOException?e)? {
          14
          15 ????????}

          16 ????}

          17
          18 ???? public ? static ?String?gbEncoding( final ?String?gbString)? {
          19 ???????? char []?utfBytes? = ?gbString.toCharArray();
          20 ????????String?unicodeBytes? = ? "" ;
          21 ???????? for ?( int ?byteIndex? = ? 0 ;?byteIndex? < ?utfBytes.length;?byteIndex ++ )? {
          22 ????????????String?hexB? = ?Integer.toHexString(utfBytes[byteIndex]);
          23 ???????????? if ?(hexB.length()? <= ? 2 )? {
          24 ????????????????hexB? = ? " 00 " ? + ?hexB;
          25 ????????????}

          26 ????????????unicodeBytes? = ?unicodeBytes? + ? " \\u " ? + ?hexB;
          27 ????????}

          28 ???????? // ?System.out.println("unicodeBytes?is:?"?+?unicodeBytes);
          29 ???????? return ?unicodeBytes;
          30 ????}

          31
          32 ???? /**
          33 ?????*?This?method?will?decode?the?String?to?a?recognized?String?in?ui.
          34 ?????*?功能:將unicod碼轉(zhuǎn)為需要的格式(utf-8)
          35 ?????*? @author ?javajohn
          36 ?????*? @param ?dataStr
          37 ?????*? @return
          38 ????? */

          39 ???? public ? static ?StringBuffer?decodeUnicode( final ?String?dataStr)? {
          40 ???????? final ?StringBuffer?buffer? = ? new ?StringBuffer();
          41 ????????String?tempStr? = ? "" ;
          42 ????????String?operStr? = ?dataStr;
          43 ???????? if (operStr? != ? null ? && ?operStr.indexOf( " \\u " )? == ? - 1 )? return ?buffer.append(operStr); //
          44 ???????? if (operStr? != ? null ? && ? ! operStr.equals( "" )? && ? ! operStr.startsWith( " \\u " )) { //
          45 ????????????tempStr? = ?operStr.substring( 0 ,operStr.indexOf( " \\u " )); //?
          46????????????operStr?=?operStr.substring(operStr.indexOf("\\u"),operStr.length());//operStr字符一定是以unicode編碼字符打頭的字符串
          47????????}

          48 ????????buffer.append(tempStr);
          49 ???????? while ?(operStr? != ? null ? && ? ! operStr.equals( "" )? && ?operStr.startsWith( " \\u " )) { // 循環(huán)處理,處理對象一定是以unicode編碼字符打頭的字符串
          50 ????????????tempStr? = ?operStr.substring( 0 , 6 );
          51 ????????????operStr? = ?operStr.substring( 6 ,operStr.length());
          52 ????????????String?charStr? = ? "" ;
          53 ????????????charStr? = ?tempStr.substring( 2 ,?tempStr.length());
          54 ???????????? char ?letter? = ?( char )?Integer.parseInt(charStr,? 16 );? // ?16進(jìn)制parse整形字符串。
          55 ????????????buffer.append( new ?Character(letter).toString());
          56 ???????????? if (operStr.indexOf( " \\u " )? == ? - 1 ) { //?
          57????????????????buffer.append(operStr);
          58????????????}
          else { // 處理operStr使其打頭字符為unicode字符
          59 ????????????????tempStr? = ?operStr.substring( 0 ,operStr.indexOf( " \\u " ));
          60 ????????????????operStr? = ?operStr.substring(operStr.indexOf( " \\u " ),operStr.length());
          61 ????????????????buffer.append(tempStr);
          62 ????????????}

          63 ????????}

          64 ???????? return ?buffer;
          65 ????}

          一、???????????? 結(jié)尾:

          posted on 2006-07-17 11:07 javajohn 閱讀(5537) 評論(1)  編輯  收藏 所屬分類: 我的記憶

          Feedback

          # re: 漢字(中文)還是unicode 2006-07-18 17:11 小豬

          關(guān)于代碼單元和代碼點(diǎn)的理解:
          1、一個(gè)代碼點(diǎn)可能包含一個(gè)或兩個(gè)代碼單元。
          2、在我的測試程序中,“我 ”也只占用一個(gè)代碼單元。即代碼點(diǎn)數(shù)等于代碼單元數(shù)。
          下面是在unicode的官方網(wǎng)站上找到的關(guān)于unicode的中文,韓文,日文的一些說明:
          Q: I have heard that UTF-8 does not support some Japanese characters. Is this correct?

          A: There is a lot of misinformation floating around about the support of Chinese, Japanese and Korean (CJK) characters. The Unicode Standard supports all of the CJK characters from JIS X 0208, JIS X 0212, JIS X 0221, or JIS X 0213, for example, and many more. This is true no matter which encoding form of Unicode is used: UTF-8, UTF-16, or UTF-32.

          Unicode supports over 70,000 CJK characters right now, and work is underway to encode further additions. The International Standard ISO/IEC 10646 and the Unicode Standard are completely synchronized in repertoire and content. And that means that Unicode has the same repertoire as GB 18030, since that also is synchronized with ISO 10646 — although with a different ordering and byte format.
          無論是那個(gè)編碼方式(UTF-8, UTF-16, or UTF-32)都可以對中文全面支持?


          我的測試程序如下:
          public class test0 {
          public static void main(String[] args)
          {String a="我 ";
          int cuCount=a.length();
          System.out.println("the number of code units required for string \"test\" in the UTF-16 encoding is "+cuCount);
          int cpCount=a.codePointCount(0, a.length());
          System.out.println("the number of code points is "+cpCount);
          System.out.println("the end of string \"我 \" is "+a.charAt(a.length()-1));

          }

          }

          輸出結(jié)果為:
          the number of code units required for string "test" in the UTF-16 encoding is 2
          the number of code points is 2
          the end of string "我 " is [空格]

          在eclipse里面找到了set encoding選項(xiàng),在里面可以設(shè)置編碼方式。  回復(fù)  更多評論   


          My Links

          Blog Stats

          常用鏈接

          留言簿(7)

          隨筆分類(36)

          隨筆檔案(39)

          classmate

          good blog

          企業(yè)管理網(wǎng)站

          好友

          站點(diǎn)收藏

          搜索

          最新評論

          閱讀排行榜

          評論排行榜

          主站蜘蛛池模板: 长武县| 谢通门县| 开鲁县| 浮梁县| 长顺县| 民和| 兰西县| 南漳县| 正定县| 荥经县| 浏阳市| 尼勒克县| 大悟县| 和平县| 张家川| 苏尼特右旗| 横峰县| 晋江市| 灵宝市| 偃师市| 庆元县| 萝北县| 赤峰市| 龙里县| 旺苍县| 南召县| 教育| 景泰县| 佛山市| 东丽区| 乌兰县| 淄博市| 台东市| 鱼台县| 中超| 齐齐哈尔市| 错那县| 庆城县| 赣榆县| 蕲春县| 永州市|