隨筆-1  評(píng)論-1  文章-0  trackbacks-0
            2005年11月3日

          ?????? 眾所周知,XML的快速發(fā)展,已經(jīng)越來(lái)越多地出現(xiàn)在數(shù)據(jù)交互、文件配置和格式化的數(shù)據(jù)載體中,盡管XML支持的語(yǔ)言越來(lái)越多,但是還是有些字符是不被支持的。我在工作中就遇到了這樣的問(wèn)題,通過(guò).csv文件錄入數(shù)據(jù),然后通過(guò)在程序中轉(zhuǎn)化為XML中間文件,再將XML文件錄入進(jìn)行處理時(shí)發(fā)生異常,經(jīng)過(guò)分析,是由于.csv文件中含有XML不支持的非法字符。

          ?????? 查詢(xún)XML規(guī)范(http://www.w3.org/TR/2004/REC-xml-20040204),得知:XML支持的字符范圍:

          Character Range

          [2]???

          Char

          ???::=???

          #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

          /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */

          ?

          ?

          從字符串中檢測(cè)XML不支持的字符的Java代碼:

          ??1public?static?int?checkCharacterData(String?text){
          ??2
          ??3????????int?errorChar=0;
          ??4
          ??5????????if?(text?==?null)?{
          ??6
          ??7????????????return?errorChar;
          ??8
          ??9????????}

          ?10
          ?11????????//?do?check
          ?12
          ?13????????char[]?data?=?text.toCharArray();
          ?14
          ?15????????for?(int?i?=?0,?len?=?data.length;?i?<?len;?i++)?{
          ?16
          ?17????????????char?c?=?data[i];
          ?18
          ?19????????????int?result?=?c;
          ?20
          ?21????????????//?high?surrogate
          ?22
          ?23????????????if?(result?>=?0xD800?&&?result?<=?0xDBFF)?{
          ?24
          ?25????????????????//?Decode?surrogate?pair
          ?26
          ?27???????????????int?high?=?c;
          ?28
          ?29???????????????try?{
          ?30
          ?31?????????????????int?low?=?text.charAt(i+1);
          ?32
          ?33?????????????????if?(low?<?0xDC00?||?low?>?0xDFFF)?{
          ?34
          ?35????????????????????char?ch=(char)low;
          ?36
          ?37????????????????????//System.err.println(ch);
          ?38
          ?39?????????????????}

          ?40
          ?41?????????????????//?Algorithm?defined?in?Unicode?spec
          ?42
          ?43?????????????????result?=?(high-0xD800)*0x400?+?(low-0xDC00)?+?0x10000;
          ?44
          ?45?????????????????i++;
          ?46
          ?47???????????????}

          ?48
          ?49???????????????catch?(IndexOutOfBoundsException?e)?{
          ?50
          ?51???????????????????e.printStackTrace();
          ?52
          ?53???????????????}

          ?54
          ?55????????????}

          ?56
          ?57?
          ?58
          ?59????????????if?(!isXMLCharacter(result))?{
          ?60
          ?61????????????????//?Likely?this?character?can't?be?easily?displayed
          ?62
          ?63????????????????//?because?it's?a?control?so?we?use?its?hexadecimal
          ?64
          ?65????????????????//?representation?in?the?reason.
          ?66
          ?67????????????????errorChar++;
          ?68
          ?69????????????}
          ??????
          ?70
          ?71????????}

          ?72
          ?73????????//?If?we?got?here,?everything?is?OK
          ?74
          ?75????????return?errorChar;
          ?76
          ?77????}

          ?78
          ?79?
          ?80
          ?81????
          ?82
          ?83????private?static?boolean?isXMLCharacter(int?c)?{
          ?84
          ?85?????????if?(c?<=?0xD7FF)??{
          ?86
          ?87?????????????if?(c?>=?0x20)?return?true;
          ?88
          ?89?????????????else?{
          ?90
          ?91??????????????????if?(c?==?'\n')?return?true;
          ?92
          ?93??????????????????if?(c?==?'\r')?return?true;
          ?94
          ?95??????????????????if?(c?==?'\t')?return?true;
          ?96
          ?97??????????????????return?false;
          ?98
          ?99?????????????}

          100
          101?????????}

          102
          103?
          104
          105?????????if?(c?<?0xE000)?return?false;??if?(c?<=?0xFFFD)?return?true;
          106
          107?????????if?(c?<?0x10000)?return?false;??if?(c?<=?0x10FFFF)?return?true;
          108
          109???????
          110
          111?????????return?false;
          112
          113????}

          114
          posted @ 2005-11-03 13:39 Jered 閱讀(1526) | 評(píng)論 (1)編輯 收藏
          僅列出標(biāo)題  
          主站蜘蛛池模板: 如东县| 南城县| 河源市| 上饶市| 泽州县| 扎鲁特旗| 化德县| 安阳市| 西昌市| 丹寨县| 朝阳市| 黄山市| 姜堰市| 吉安市| 平原县| 奇台县| 从江县| 民勤县| 改则县| 忻城县| 泰顺县| 浮山县| 彭州市| 哈尔滨市| 南木林县| 麻栗坡县| 郑州市| 阿克| 安庆市| 石河子市| 如皋市| 云梦县| 美姑县| 区。| 盐山县| 新巴尔虎左旗| 卢龙县| 普陀区| 寻乌县| 莱西市| 全南县|