隨筆-1  評論-1  文章-0  trackbacks-0

          ?????? 眾所周知,XML的快速發展,已經越來越多地出現在數據交互、文件配置和格式化的數據載體中,盡管XML支持的語言越來越多,但是還是有些字符是不被支持的。我在工作中就遇到了這樣的問題,通過.csv文件錄入數據,然后通過在程序中轉化為XML中間文件,再將XML文件錄入進行處理時發生異常,經過分析,是由于.csv文件中含有XML不支持的非法字符。

          ?????? 查詢XML規范(http://www.w3.org/TR/2004/REC-xml-20040204),得知:XML支持的字符范圍:

          Character Range

          [2]???

          Char

          ???::=???

          #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

          /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */

          ?

          ?

          從字符串中檢測XML不支持的字符的Java代碼:

          ??1public?static?int?checkCharacterData(String?text){
          ??2
          ??3????????int?errorChar=0;
          ??4
          ??5????????if?(text?==?null)?{
          ??6
          ??7????????????return?errorChar;
          ??8
          ??9????????}

          ?10
          ?11????????//?do?check
          ?12
          ?13????????char[]?data?=?text.toCharArray();
          ?14
          ?15????????for?(int?i?=?0,?len?=?data.length;?i?<?len;?i++)?{
          ?16
          ?17????????????char?c?=?data[i];
          ?18
          ?19????????????int?result?=?c;
          ?20
          ?21????????????//?high?surrogate
          ?22
          ?23????????????if?(result?>=?0xD800?&&?result?<=?0xDBFF)?{
          ?24
          ?25????????????????//?Decode?surrogate?pair
          ?26
          ?27???????????????int?high?=?c;
          ?28
          ?29???????????????try?{
          ?30
          ?31?????????????????int?low?=?text.charAt(i+1);
          ?32
          ?33?????????????????if?(low?<?0xDC00?||?low?>?0xDFFF)?{
          ?34
          ?35????????????????????char?ch=(char)low;
          ?36
          ?37????????????????????//System.err.println(ch);
          ?38
          ?39?????????????????}

          ?40
          ?41?????????????????//?Algorithm?defined?in?Unicode?spec
          ?42
          ?43?????????????????result?=?(high-0xD800)*0x400?+?(low-0xDC00)?+?0x10000;
          ?44
          ?45?????????????????i++;
          ?46
          ?47???????????????}

          ?48
          ?49???????????????catch?(IndexOutOfBoundsException?e)?{
          ?50
          ?51???????????????????e.printStackTrace();
          ?52
          ?53???????????????}

          ?54
          ?55????????????}

          ?56
          ?57?
          ?58
          ?59????????????if?(!isXMLCharacter(result))?{
          ?60
          ?61????????????????//?Likely?this?character?can't?be?easily?displayed
          ?62
          ?63????????????????//?because?it's?a?control?so?we?use?its?hexadecimal
          ?64
          ?65????????????????//?representation?in?the?reason.
          ?66
          ?67????????????????errorChar++;
          ?68
          ?69????????????}
          ??????
          ?70
          ?71????????}

          ?72
          ?73????????//?If?we?got?here,?everything?is?OK
          ?74
          ?75????????return?errorChar;
          ?76
          ?77????}

          ?78
          ?79?
          ?80
          ?81????
          ?82
          ?83????private?static?boolean?isXMLCharacter(int?c)?{
          ?84
          ?85?????????if?(c?<=?0xD7FF)??{
          ?86
          ?87?????????????if?(c?>=?0x20)?return?true;
          ?88
          ?89?????????????else?{
          ?90
          ?91??????????????????if?(c?==?'\n')?return?true;
          ?92
          ?93??????????????????if?(c?==?'\r')?return?true;
          ?94
          ?95??????????????????if?(c?==?'\t')?return?true;
          ?96
          ?97??????????????????return?false;
          ?98
          ?99?????????????}

          100
          101?????????}

          102
          103?
          104
          105?????????if?(c?<?0xE000)?return?false;??if?(c?<=?0xFFFD)?return?true;
          106
          107?????????if?(c?<?0x10000)?return?false;??if?(c?<=?0x10FFFF)?return?true;
          108
          109???????
          110
          111?????????return?false;
          112
          113????}

          114
          posted on 2005-11-03 13:39 Jered 閱讀(1525) 評論(1)  編輯  收藏 所屬分類: Java

          評論:
          # re: 過濾XML不支持的字符 2006-11-22 14:43 | 呵呵[匿名]
          不能用用阿~~

          char c = data[i];
          18
          19 int result = c;
          20
          21 // high surrogate
          22
          23 if (result >= 0xD800 && result <= 0xDBFF) {
          24

          這個地方有點問題?
          result 十六
            回復  更多評論
            

          只有注冊用戶登錄后才能發表評論。


          網站導航:
           
          主站蜘蛛池模板: 麻栗坡县| 伊通| 龙山县| 尼勒克县| 门源| 南华县| 尖扎县| 衡阳市| 江北区| 金平| 太原市| 莒南县| 屏东市| 巴里| 昔阳县| 富阳市| 呼和浩特市| 电白县| 慈溪市| 津南区| 长顺县| 自治县| 农安县| 牟定县| 明光市| 伊吾县| 鹤壁市| 青铜峡市| 无为县| 阿克苏市| 扬中市| 安化县| 兖州市| 新竹市| 丽水市| 泉州市| 钟山县| 金堂县| 常德市| 玉溪市| 武冈市|