posts - 11,comments - 8,trackbacks - 0

          本文摘自:http://blog.hexun.com/nantiange/1776819/rss/viewarticle.html

          1. Codepage的定義和歷史

          字符內碼(charcter code)指的是用來代表字符的內碼.讀者在輸入和存儲文檔時都要使用內碼,內碼分為

          • 單字節內碼 -- Single-Byte character sets (SBCS),可以支持256個字符編碼.
          • 雙字節內碼 -- Double-Byte character sets)(DBCS),可以支持65000個字符編碼.主要用來對大字符集的東方文字進行編碼.

          codepage 指的是一個經過挑選的以特定順序排列的字符內碼列表,對于早期的單字節內碼的語種,codepage中的內碼順序使得系統可以按照此列表來根據痰氖淙脛蹈鲆桓齠雜Φ哪諑?對于雙字節內碼,則給出的是MultiByte到Unicode的對應表,這樣就可以把以Unicode形式存放的字符轉化為相應的字符內碼,或者反之,在Linux核心中對應的函數就是utf8_mbtowc和utf8_wctomb.

          在1980年前,仍然沒有任何國際標準如ISO-8859或Unicode來定義如何擴展US-ASCII編碼以便非英語國家的用戶使用.很多IT 廠商發明了他們自己的編碼,并且使用了難以記憶的數目來標識:

          ?

          ?

          例如936代表簡體中文. 950代表繁體中文.

          ?

          ?

          1.1 CJK Codepage

          同 Extended Unix Coding ( EUC )編碼大不一樣的是,下面所有的遠東 codepage 都利用了C1控制碼 { =80..=9F } 做為首字節, 使用ASCII值 { =40..=7E { 做為第二字節,這樣才能包含多達數萬個雙字節字符,這表明在這種編碼之中小于3F的ASCII值不一定代表ASCII字符.

          CP932

          Shift-JIS包含日本語 charset JIS X 0201 (每個字符一個字節) 和 JIS X 0208 (每個字符兩個字節),所以 JIS X 0201平假名包含一個字節半寬的字符,其剩馀的60個字節被用做7076個漢字以及648個其他全寬字符的首字節.同EUC-JP編碼區別的是, Shift-JIS沒有包含JIS X 202中定義的5802個漢字.

          CP936

          GBK 擴展了 EUC-CN 編碼( GB 2312-80編碼,包含 6763 個漢字)到Unicode (GB13000.1-93)中定義的20902個漢字,中國大陸使用的是簡體中文zh_CN.

          CP949

          UnifiedHangul (UHC) 是韓文 EUC-KR 編碼(KS C 5601-1992 編碼,包括2350 韓文音節和 4888 個漢字a)的超集,包含 8822個附加的韓文音節( 在C1中 )

          CP950

          是代替EUC-TW (CNS 11643-1992)的 Big5 編碼(13072 繁體 zh_TW 中文字) 繁體中文,這些定義都在Ken Lunde的 CJK.INF中或者 Unicode 編碼表中找到.

          注意: Microsoft采用以上四種Codepage,因此要訪問Microsoft的文件系統時必需采用上面的Codepage .

          ?

          1.2 IBM的遠東語言Codepage

          IBM的Codepage分為SBCS和DBCS兩種:

          IBM SBCS Codepage

          ?

        1. 37 (英文) *
        2. 290 (日文) *
        3. 833 (韓文) *
        4. 836 (簡體中文) *
        5. 891 (韓文)
        6. 897 (日文)
        7. 903 (簡體中文)
        8. 904 (繁體中文)

          IBM DBCS Codepage

        9. 300 (日文) *
        10. 301 (日文)
        11. 834 (韓文) *
        12. 835 (繁體中文) *
        13. 837 (簡體中文) *
        14. 926 (韓文)
        15. 927 (繁體中文)
        16. 928 (簡體中文)

          將SBCS的Codepage和DBCS的Codepage混合起來就成為: IBM MBCS Codepage

        17. 930 (日文) (Codepage 300 加 290) *
        18. 932 (日文) (Codepage 301 加 897)
        19. 933 (韓文) (Codepage 834 加 833) *
        20. 934 (韓文) (Codepage 926 加 891)
        21. 938 (繁體中文) (Codepage 927 加 904)
        22. 936 (簡體中文) (Codepage 928 加 903)
        23. 5031 (簡體中文) (Codepage 837 加 836) *
        24. 5033 (繁體中文) (Codepage 835 加 37) *

          *代表采用EBCDIC編碼格式

          由此可見,Mircosoft的CJK Codepage來源于IBM的Codepage.

          2. Linux下Codepage的作用

          在Linux下引入對Codepage的支持主要是為了訪問FAT/VFAT/FAT32/NTFS/NCPFS等文件系統下的多語種文件名的問題,目前在NTFS和FAT32/VFAT下的文件系統上都使用了Unicode,這就需要系統在讀取這些文件名時動態將其轉換為相應的語言編碼.因此引入了NLS支持.其相應的程序文件在/usr/src/linux/fs/nls下:

          ?

          • Config.in
          • Makefile
          • nls_base.c
          • nls_cp437.c
          • nls_cp737.c
          • nls_cp775.c
          • nls_cp850.c
          • nls_cp852.c
          • nls_cp855.c
          • nls_cp857.c
          • nls_cp860.c
          • nls_cp861.c
          • nls_cp862.c
          • nls_cp863.c
          • nls_cp864.c
          • nls_cp865.c
          • nls_cp866.c
          • nls_cp869.c
          • nls_cp874.c
          • nls_cp936.c
          • nls_cp950.c
          • nls_iso8859-1.c
          • nls_iso8859-15.c
          • nls_iso8859-2.c
          • nls_iso8859-3.c
          • nls_iso8859-4.c
          • nls_iso8859-5.c
          • nls_iso8859-6.c
          • nls_iso8859-7.c
          • nls_iso8859-8.c
          • nls_iso8859-9.c
          • nls_koi8-r.c

          實現了下列函數:

          • extern int utf8_mbtowc(__u16 *, const __u8 *, int);
          • extern int utf8_mbstowcs(__u16 *, const __u8 *, int);
          • extern int utf8_wctomb(__u8 *, __u16, int);
          • extern int utf8_wcstombs(__u8 *, const __u16 *, int);

          ?

          這樣在加載相應的文件系統時就可以用下面的參數來設置Codepage:

          對于Codepage 437 來說

          mount -t vfat /dev/hda1 /mnt/1 -o codepage=437,iocharset=cp437

          ?

          這樣在Linux下就可以正常訪問不同語種的長文件名了.

          3. Linux下支持的Codepage

          ?

          ?

        25. nls codepage 437 -- 美國/加拿大英語

          ?

        26. nls codepage 737 -- 希臘語

          ?

        27. nls codepage 775 -- 波羅的海語

          ?

        28. nls codepage 850 -- 包括西歐語種(德語,西班牙語,意大利語)中的一些字符

          ?

        29. nls codepage 852 -- Latin 2 包括中東歐語種(阿爾巴尼亞語,克羅地亞語,捷克語,英語,芬蘭語,匈牙利語,愛爾蘭語,德語,波蘭語,羅馬利亞語,塞爾維亞語,斯洛伐克語,斯洛文尼亞語,Sorbian語)

          ?

        30. nls codepage 855 -- 斯拉夫語

          ?

        31. nls codepage 857 -- 土耳其語

          ?

        32. nls codepage 860 -- 葡萄牙語

          ?

        33. nls codepage 861 -- 冰島語

          ?

        34. nls codepage 862 -- 希伯來語

          ?

        35. nls codepage 863 -- 加拿大語

          ?

        36. nls codepage 864 -- 阿拉伯語

          ?

        37. nls codepage 865 -- 日爾曼語系

          ?

        38. nls codepage 866 -- 斯拉夫語/俄語

          ?

        39. nls codepage 869 -- 希臘語(2)

          ?

        40. nls codepage 874 -- 泰語

          ?

        41. nls codepage 936 -- 簡體中文GBK

          ?

        42. nls codepage 950 -- 繁體中文Big5

          ?

        43. nls iso8859-1 -- 西歐語系(阿爾巴尼亞語,西班牙加泰羅尼亞語,丹麥語,荷蘭語,英語,Faeroese語,芬蘭語,法語,德語,加里西亞語,愛爾蘭語,冰島語,意大利語,挪威語,葡萄牙語,瑞士語.)這同時適用于美國英語.

          ?

        44. nls iso8859-2 -- Latin 2 字符集,斯拉夫/中歐語系(捷克語,德語,匈牙利語,波蘭語,羅馬尼亞語,克羅地亞語,斯洛伐克語,斯洛文尼亞語)

          ?

        45. nls iso8859-3 -- Latin 3 字符集, (世界語,加里西亞語,馬耳他語,土耳其語)

          ?

        46. nls iso8859-4 -- Latin 4 字符集, (愛莎尼亞語,拉脫維亞語,立陶宛語),是Latin 6 字符集的前序標準

          ?

        47. nls iso8859-5 -- 斯拉夫語系(保加利亞語,Byelorussian語,馬其頓語,俄語,塞爾維亞語,烏克蘭語) 一般推薦使用 KOI8-R codepage

          ?

        48. nls iso8859-6 -- 阿拉伯語.

          ?

        49. nls iso8859-7 -- 現代希臘語

          ?

        50. nls iso8859-8 -- 希伯來語

          ?

        51. nls iso8859-9 -- Latin 5 字符集, (去掉了 Latin 1中不經常使用的一些冰島語字符而代以土耳其語字符)

          ?

        52. nls iso8859-10 -- Latin 6 字符集, (因紐特(格陵蘭)語,薩摩斯島語等Latin 4 中沒有包括的北歐語種)

          ?

        53. nls iso8859-15 -- Latin 9 字符集, 是Latin 1字符集的更新版本,去掉一些不常用的字符,增加了對愛莎尼亞語的支持,修正了法語和芬蘭語部份,增加了歐元字符)

          ?

        54. nls koi8-r -- 俄語的缺省支持

          4. 簡體中文GBK/繁體中文Big5的Codepage

          如何制作簡體中文GBK/繁體中文Big5的Codepage?

          ?

          1. Unicode 組織取得GBK/Big5的Unicode的定義.

            由于GBK是基于ISO 10646-1:1993標準的,而相應的日文是JIS X 0221-1995,韓文是KS C 5700-1995,他們被提交到Unicode標準的時間表為:
            Unicode Version 1.0
            Unicode Version 1.1 <-> ISO 10646-1:1993, JIS X 0221-1995, GB 13000.1-93
            Unicode Version 2.0 <-> KS C 5700-1995

            從Windows 95開始均采用GBK編碼. 您需要的是 CP936.TXTBIG5.TXT

          2. 然后使用下面的程序就可以將其轉化為Linux核心需要的Unicode<->GBK碼表
            ./genmap BIG5.txt | perl uni2big5.pl
            ./genmap CP936.txt | perl uni2gbk.pl
          3. 再修改fat/vfat/ntfs的相關函數就可以完成對核心的修改工作. 具體使用時可以使用下面的命令:

          ?

        55. 簡體中文: mount -t vfat /dev/hda1 /mnt/1 -o codepage=936,iocharset=cp936
        56. 繁體中文: mount -t vfat /dev/hda1 /mnt/1 -o codepage=950,iocharset=cp936

          有趣的是,由于GBK包含了全部的GB2312/Big5/JIS的內碼,所以使用936的Codepage也可以顯示Big5的文件名.

          5. 附錄

          ?

          5.1 作者和相關文檔

          制作codepage950支持的是臺灣的 cosmos先生, 主頁為 http://www.cis.nctu.edu.tw:8080/~is84086/Project/kernel_cp950/

          制作GBK的cp936支持的是TurboLinux的中文研發小組的 方漢陳向陽

          5.2 genmap

          ?

          #!/bin/sh
          cat $1 | awk '{if(index($1,"#")==0)print $0}' | awk 'BEGIN{FS="0x"}{print $2 $3}' | awk '{if(length($1)==length($2))print $1,$2}'

          ?

          5.3 uni2big5.pl

          ?

          ?

          #!/usr/bin/perl

          @code = (
          "00", "01", "02", "03", "04", "05", "06", "07",
          "08", "09", "0A", "0B", "0C", "0D", "0E", "0F",
          "10", "11", "12", "13", "14", "15", "16", "17",
          "18", "19", "1A", "1B", "1C", "1D", "1E", "1F",
          "20", "21", "22", "23", "24", "25", "26", "27",
          "28", "29", "2A", "2B", "2C", "2D", "2E", "2F",
          "30", "31", "32", "33", "34", "35", "36", "37",
          "38", "39", "3A", "3B", "3C", "3D", "3E", "3F",
          "40", "41", "42", "43", "44", "45", "46", "47",
          "48", "49", "4A", "4B", "4C", "4D", "4E", "4F",
          "50", "51", "52", "53", "54", "55", "56", "57",
          "58", "59", "5A", "5B", "5C", "5D", "5E", "5F",
          "60", "61", "62", "63", "64", "65", "66", "67",
          "68", "69", "6A", "6B", "6C", "6D", "6E", "6F",
          "70", "71", "72", "73", "74", "75", "76", "77",
          "78", "79", "7A", "7B", "7C", "7D", "7E", "7F",
          "80", "81", "82", "83", "84", "85", "86", "87",
          "88", "89", "8A", "8B", "8C", "8D", "8E", "8F",
          "90", "91", "92", "93", "94", "95", "96", "97",
          "98", "99", "9A", "9B", "9C", "9D", "9E", "9F",
          "A0", "A1", "A2", "A3", "A4", "A5", "A6", "A7",
          "A8", "A9", "AA", "AB", "AC", "AD", "AE", "AF",
          "B0", "B1", "B2", "B3", "B4", "B5", "B6", "B7",
          "B8", "B9", "BA", "BB", "BC", "BD", "BE", "BF",
          "C0", "C1", "C2", "C3", "C4", "C5", "C6", "C7",
          "C8", "C9", "CA", "CB", "CC", "CD", "CE", "CF",
          "D0", "D1", "D2", "D3", "D4", "D5", "D6", "D7",
          "D8", "D9", "DA", "DB", "DC", "DD", "DE", "DF",
          "E0", "E1", "E2", "E3", "E4", "E5", "E6", "E7",
          "E8", "E9", "EA", "EB", "EC", "ED", "EE", "EF",
          "F0", "F1", "F2", "F3", "F4", "F5", "F6", "F7",
          "F8", "F9", "FA", "FB", "FC", "FD", "FE", "FF");

          while (<STDIN>){
          ($unicode, $big5) = split;
          ($high, $low) = $unicode =~ /(..)(..)/;
          $table2{$high}{$low} = $big5;
          ($high, $low) = $big5 =~ /(..)(..)/;
          $table{$high}{$low} = $unicode;
          }

          print <<EOF;
          /*
          * linux/fs/nls_cp874.c
          *
          * Charset cp874 translation tables.
          * Generated automatically from the Unicode and charset
          * tables from the Unicode Organization (www.unicode.org).
          * The Unicode to charset table has only exact mappings.
          */

          #include <linux/module.h>
          #include <linux/kernel.h>
          #include <linux/string.h>
          #include <linux/nls.h>

          /* A1 - F9*/
          static struct nls_unicode charset2uni[(0xF9-0xA1+1)*(0x100-0x60)] = {
          EOF

          for ($high=0xA1; $high <= 0xF9; $high++){
          for ($low=0x40; $low <= 0x7F; $low++){
          $unicode = $table2{$code[$high]}{$code[$low]};
          $unicode = "0000" if (!(defined $unicode));
          print "\n\t" if ($low%4 == 0);
          print "/* $code[$high]$code[$low]*/\n\t" if ($low%0x10 == 0);
          ($uhigh, $ulow) = $unicode =~ /(..)(..)/;
          printf("{0x%2s, 0x%2s}, ", $ulow, $uhigh);
          }
          for ($low=0xA0; $low <= 0xFF; $low++){
          $unicode = $table2{$code[$high]}{$code[$low]};
          $unicode = "0000" if (!(defined $unicode));
          print "\n\t" if ($low%4 == 0);
          print "/* $code[$high]$code[$low]*/\n\t" if ($low%0x10 == 0);
          ($uhigh, $ulow) = $unicode =~ /(..)(..)/;
          printf("{0x%2s, 0x%2s}, ", $ulow, $uhigh);
          }
          }

          print "\n};\n\n";
          for ($high=1; $high <= 255;$high++){
          if (defined $table{$code[$high]}){
          print "static unsigned char page$code[$high]\[512\] = {\n\t";
          for ($low=0; $low<=255;$low++){
          $big5 = $table{$code[$high]}{$code[$low]};
          $big5 = "3F3F" if (!(defined $big5));
          if ($low > 0 && $low%4 == 0){
          printf("/* 0x%02X-0x%02X */\n\t", $low-4, $low-1);
          }
          print "\n\t" if ($low == 0x80);
          ($bhigh, $blow) = $big5 =~ /(..)(..)/;
          printf("0x%2s, 0x%2s, ", $bhigh, $blow);
          }
          print "/* 0xFC-0xFF */\n};\n\n";
          }
          }

          print "static unsigned char *page_uni2charset[256] = {";
          for ($high=0; $high<=255;$high++){
          print "\n\t" if ($high%8 == 0);
          if ($high>0 && defined $table{$code[$high]}){
          print "page$code[$high], ";
          }
          else{
          print "NULL, ";
          }
          }
          print <<EOF;

          };

          static unsigned char charset2upper[256] = {
          0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, /* 0x00-0x07 */
          0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, /* 0x08-0x0f */
          0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, /* 0x10-0x17 */
          0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f, /* 0x18-0x1f */
          0x20, 0x21, 0x22, 0x23, 0x24, 0x25, 0x26, 0x27, /* 0x20-0x27 */
          0x28, 0x29, 0x2a, 0x2b, 0x2c, 0x2d, 0x2e, 0x2f, /* 0x28-0x2f */
          0x30, 0x31, 0x32, 0x33, 0x34, 0x35, 0x36, 0x37, /* 0x30-0x37 */
          0x38, 0x39, 0x3a, 0x3b, 0x3c, 0x3d, 0x3e, 0x3f, /* 0x38-0x3f */
          0x40, 0x41, 0x42, 0x43, 0x44, 0x45, 0x46, 0x47, /* 0x40-0x47 */
          0x48, 0x49, 0x4a, 0x4b, 0x4c, 0x4d, 0x4e, 0x4f, /* 0x48-0x4f */
          0x50, 0x51, 0x52, 0x53, 0x54, 0x55, 0x56, 0x57, /* 0x50-0x57 */
          0x58, 0x59, 0x5a, 0x5b, 0x5c, 0x5d, 0x5e, 0x5f, /* 0x58-0x5f */
          0x60, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x60-0x67 */
          0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x68-0x6f */
          0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x70-0x77 */
          0x00, 0x00, 0x00, 0x7b, 0x7c, 0x7d, 0x7e, 0x7f, /* 0x78-0x7f */
          0x80, 0x81, 0x82, 0x83, 0x84, 0x85, 0x86, 0x87, /* 0x80-0x87 */
          0x88, 0x89, 0x8a, 0x8b, 0x8c, 0x8d, 0x8e, 0x8f, /* 0x88-0x8f */
          0x90, 0x91, 0x92, 0x93, 0x94, 0x95, 0x96, 0x97, /* 0x90-0x97 */
          0x98, 0x99, 0x9a, 0x00, 0x9c, 0x00, 0x00, 0x00, /* 0x98-0x9f */
          0x00, 0x00, 0x00, 0x00, 0xa4, 0xa5, 0xa6, 0xa7, /* 0xa0-0xa7 */
          0xa8, 0xa9, 0xaa, 0xab, 0xac, 0xad, 0xae, 0xaf, /* 0xa8-0xaf */
          0xb0, 0xb1, 0xb2, 0xb3, 0xb4, 0xb5, 0xb6, 0xb7, /* 0xb0-0xb7 */
          0xb8, 0xb9, 0xba, 0xbb, 0xbc, 0xbd, 0xbe, 0xbf, /* 0xb8-0xbf */
          0xc0, 0xc1, 0xc2, 0xc3, 0xc4, 0xc5, 0xc6, 0xc7, /* 0xc0-0xc7 */
          0xc8, 0xc9, 0xca, 0xcb, 0xcc, 0xcd, 0xce, 0xcf, /* 0xc8-0xcf */
          0xd0, 0xd1, 0xd2, 0xd3, 0xd4, 0xd5, 0x00, 0x00, /* 0xd0-0xd7 */
          0x00, 0xd9, 0xda, 0xdb, 0xdc, 0x00, 0x00, 0xdf, /* 0xd8-0xdf */
          0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0xe0-0xe7 */
          0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0xef, /* 0xe8-0xef */
          0xf0, 0xf1, 0x00, 0x00, 0x00, 0xf5, 0x00, 0xf7, /* 0xf0-0xf7 */
          0xf8, 0xf9, 0x00, 0x00, 0x00, 0x00, 0xfe, 0xff, /* 0xf8-0xff */
          };


          static void inc_use_count(void)
          {
          MOD_INC_USE_COUNT;
          }

          static void dec_use_count(void)
          {
          MOD_DEC_USE_COUNT;
          }

          static struct nls_table table = {
          "cp950",
          page_uni2charset,
          charset2uni,
          inc_use_count,
          dec_use_count,
          NULL
          };

          int init_nls_cp950(void)
          {
          return register_nls();
          }

          #ifdef MODULE
          int init_module(void)
          {
          return init_nls_cp950();
          }


          void cleanup_module(void)
          {
          unregister_nls();
          return;
          }
          #endif

          /*
          * Overrides for Emacs so that we follow Linus's tabbing style.
          * Emacs will notice this stuff at the end of the file and automatically
          * adjust the settings for this buffer only. This must remain at the end
          * of the file.
          *
          ---------------------------------------------------------------------------
          * Local variables:
          * c-indent-level: 8
          * c-brace-imaginary-offset: 0
          * c-brace-offset: -8
          * c-argdecl-indent: 8
          * c-label-offset: -8
          * c-continued-statement-offset: 8
          * c-continued-brace-offset: 0
          * End:
          */
          EOF

          ?

          5.4 uni2gbk.pl

          ?

          #!/usr/bin/perl

          @code = (
          "00", "01", "02", "03", "04", "05", "06", "07",
          "08", "09", "0A", "0B", "0C", "0D", "0E", "0F",
          "10", "11", "12", "13", "14", "15", "16", "17",
          "18", "19", "1A", "1B", "1C", "1D", "1E", "1F",
          "20", "21", "22", "23", "24", "25", "26", "27",
          "28", "29", "2A", "2B", "2C", "2D", "2E", "2F",
          "30", "31", "32", "33", "34", "35", "36", "37",
          "38", "39", "3A", "3B", "3C", "3D", "3E", "3F",
          "40", "41", "42", "43", "44", "45", "46", "47",
          "48", "49", "4A", "4B", "4C", "4D", "4E", "4F",
          "50", "51", "52", "53", "54", "55", "56", "57",
          "58", "59", "5A", "5B", "5C", "5D", "5E", "5F",
          "60", "61", "62", "63", "64", "65", "66", "67",
          "68", "69", "6A", "6B", "6C", "6D", "6E", "6F",
          "70", "71", "72", "73", "74", "75", "76", "77",
          "78", "79", "7A", "7B", "7C", "7D", "7E", "7F",
          "80", "81", "82", "83", "84", "85", "86", "87",
          "88", "89", "8A", "8B", "8C", "8D", "8E", "8F",
          "90", "91", "92", "93", "94", "95", "96", "97",
          "98", "99", "9A", "9B", "9C", "9D", "9E", "9F",
          "A0", "A1", "A2", "A3", "A4", "A5", "A6", "A7",
          "A8", "A9", "AA", "AB", "AC", "AD", "AE", "AF",
          "B0", "B1", "B2", "B3", "B4", "B5", "B6", "B7",
          "B8", "B9", "BA", "BB", "BC", "BD", "BE", "BF",
          "C0", "C1", "C2", "C3", "C4", "C5", "C6", "C7",
          "C8", "C9", "CA", "CB", "CC", "CD", "CE", "CF",
          "D0", "D1", "D2", "D3", "D4", "D5", "D6", "D7",
          "D8", "D9", "DA", "DB", "DC", "DD", "DE", "DF",
          "E0", "E1", "E2", "E3", "E4", "E5", "E6", "E7",
          "E8", "E9", "EA", "EB", "EC", "ED", "EE", "EF",
          "F0", "F1", "F2", "F3", "F4", "F5", "F6", "F7",
          "F8", "F9", "FA", "FB", "FC", "FD", "FE", "FF");

          while (<STDIN>){
          ($unicode, $big5) = split;
          ($high, $low) = $unicode =~ /(..)(..)/;
          $table2{$high}{$low} = $big5;
          ($high, $low) = $big5 =~ /(..)(..)/;
          $table{$high}{$low} = $unicode;
          }

          print <<EOF;
          /*
          * linux/fs/nls_cp936.c
          *
          * Charset cp936 translation tables.
          * Generated automatically from the Unicode and charset
          * tables from the Unicode Organization (www.unicode.org).
          * The Unicode to charset table has only exact mappings.
          */

          #include <linux/module.h>
          #include <linux/kernel.h>
          #include <linux/string.h>
          #include <linux/nls.h>

          /* 81 - FE*/
          static struct nls_unicode charset2uni[(0xFE-0x81+1)*(0x100-0x40)] = {
          EOF

          for ($high=0x81; $high <= 0xFE; $high++){
          for ($low=0x40; $low <= 0x7F; $low++){
          $unicode = $table2{$code[$high]}{$code[$low]};
          $unicode = "0000" if (!(defined $unicode));
          print "\n\t" if ($low%4 == 0);
          print "/* $code[$high]$code[$low]*/\n\t" if ($low%0x10 == 0);
          ($uhigh, $ulow) = $unicode =~ /(..)(..)/;
          printf("{0x%2s, 0x%2s}, ", $ulow, $uhigh);
          }
          for ($low=0x80; $low <= 0xFF; $low++){
          $unicode = $table2{$code[$high]}{$code[$low]};
          $unicode = "0000" if (!(defined $unicode));
          print "\n\t" if ($low%4 == 0);
          print "/* $code[$high]$code[$low]*/\n\t" if ($low%0x10 == 0);
          ($uhigh, $ulow) = $unicode =~ /(..)(..)/;
          printf("{0x%2s, 0x%2s}, ", $ulow, $uhigh);
          }
          }

          print "\n};\n\n";
          for ($high=1; $high <= 255;$high++){
          if (defined $table{$code[$high]}){
          print "static unsigned char page$code[$high]\[512\] = {\n\t";
          for ($low=0; $low<=255;$low++){
          $big5 = $table{$code[$high]}{$code[$low]};
          $big5 = "3F3F" if (!(defined $big5));
          if ($low > 0 && $low%4 == 0){
          printf("/* 0x%02X-0x%02X */\n\t", $low-4, $low-1);
          }
          print "\n\t" if ($low == 0x80);
          ($bhigh, $blow) = $big5 =~ /(..)(..)/;
          printf("0x%2s, 0x%2s, ", $bhigh, $blow);
          }
          print "/* 0xFC-0xFF */\n};\n\n";
          }
          }

          print "static unsigned char *page_uni2charset[256] = {";
          for ($high=0; $high<=255;$high++){
          print "\n\t" if ($high%8 == 0);
          if ($high>0 && defined $table{$code[$high]}){
          print "page$code[$high], ";
          }
          else{
          print "NULL, ";
          }
          }
          print <<EOF;

          };

          static unsigned char charset2upper[256] = {
          0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, /* 0x00-0x07 */
          0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, /* 0x08-0x0f */
          0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, /* 0x10-0x17 */
          0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f, /* 0x18-0x1f */
          0x20, 0x21, 0x22, 0x23, 0x24, 0x25, 0x26, 0x27, /* 0x20-0x27 */
          0x28, 0x29, 0x2a, 0x2b, 0x2c, 0x2d, 0x2e, 0x2f, /* 0x28-0x2f */
          0x30, 0x31, 0x32, 0x33, 0x34, 0x35, 0x36, 0x37, /* 0x30-0x37 */
          0x38, 0x39, 0x3a, 0x3b, 0x3c, 0x3d, 0x3e, 0x3f, /* 0x38-0x3f */
          0x40, 0x41, 0x42, 0x43, 0x44, 0x45, 0x46, 0x47, /* 0x40-0x47 */
          0x48, 0x49, 0x4a, 0x4b, 0x4c, 0x4d, 0x4e, 0x4f, /* 0x48-0x4f */
          0x50, 0x51, 0x52, 0x53, 0x54, 0x55, 0x56, 0x57, /* 0x50-0x57 */
          0x58, 0x59, 0x5a, 0x5b, 0x5c, 0x5d, 0x5e, 0x5f, /* 0x58-0x5f */
          0x60, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x60-0x67 */
          0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x68-0x6f */
          0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x70-0x77 */
          0x00, 0x00, 0x00, 0x7b, 0x7c, 0x7d, 0x7e, 0x7f, /* 0x78-0x7f */
          0x80, 0x81, 0x82, 0x83, 0x84, 0x85, 0x86, 0x87, /* 0x80-0x87 */
          0x88, 0x89, 0x8a, 0x8b, 0x8c, 0x8d, 0x8e, 0x8f, /* 0x88-0x8f */
          0x90, 0x91, 0x92, 0x93, 0x94, 0x95, 0x96, 0x97, /* 0x90-0x97 */
          0x98, 0x99, 0x9a, 0x00, 0x9c, 0x00, 0x00, 0x00, /* 0x98-0x9f */
          0x00, 0x00, 0x00, 0x00, 0xa4, 0xa5, 0xa6, 0xa7, /* 0xa0-0xa7 */
          0xa8, 0xa9, 0xaa, 0xab, 0xac, 0xad, 0xae, 0xaf, /* 0xa8-0xaf */
          0xb0, 0xb1, 0xb2, 0xb3, 0xb4, 0xb5, 0xb6, 0xb7, /* 0xb0-0xb7 */
          0xb8, 0xb9, 0xba, 0xbb, 0xbc, 0xbd, 0xbe, 0xbf, /* 0xb8-0xbf */
          0xc0, 0xc1, 0xc2, 0xc3, 0xc4, 0xc5, 0xc6, 0xc7, /* 0xc0-0xc7 */
          0xc8, 0xc9, 0xca, 0xcb, 0xcc, 0xcd, 0xce, 0xcf, /* 0xc8-0xcf */
          0xd0, 0xd1, 0xd2, 0xd3, 0xd4, 0xd5, 0x00, 0x00, /* 0xd0-0xd7 */
          0x00, 0xd9, 0xda, 0xdb, 0xdc, 0x00, 0x00, 0xdf, /* 0xd8-0xdf */
          0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0xe0-0xe7 */
          0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0xef, /* 0xe8-0xef */
          0xf0, 0xf1, 0x00, 0x00, 0x00, 0xf5, 0x00, 0xf7, /* 0xf0-0xf7 */
          0xf8, 0xf9, 0x00, 0x00, 0x00, 0x00, 0xfe, 0xff, /* 0xf8-0xff */
          };


          static void inc_use_count(void)
          {
          MOD_INC_USE_COUNT;
          }

          static void dec_use_count(void)
          {
          MOD_DEC_USE_COUNT;
          }

          static struct nls_table table = {
          "cp936",
          page_uni2charset,
          charset2uni,
          inc_use_count,
          dec_use_count,
          NULL
          };

          int init_nls_cp936(void)
          {
          return register_nls();
          }

          #ifdef MODULE
          int init_module(void)
          {
          return init_nls_cp936();
          }


          void cleanup_module(void)
          {
          unregister_nls();
          return;
          }
          #endif

          /*
          * Overrides for Emacs so that we follow Linus's tabbing style.
          * Emacs will notice this stuff at the end of the file and automatically
          * adjust the settings for this buffer only. This must remain at the end
          * of the file.
          *
          ---------------------------------------------------------------------------
          * Local variables:
          * c-indent-level: 8
          * c-brace-imaginary-offset: 0
          * c-brace-offset: -8
          * c-argdecl-indent: 8
          * c-label-offset: -8
          * c-continued-statement-offset: 8
          * c-continued-brace-offset: 0
          * End:
          */
          EOF

          ?

          ?

          5.5 轉換CODEPAGE的工具

          ?

          /*
          * CPI.C: A program to examine MSDOS codepage files (*.cpi)
          * and extract specific codepages.
          * Compiles under Linux & DOS (using BC++ 3.1).
          *
          * Compile: gcc -o cpi cpi.c
          * Call: codepage file.cpi [-a|-l|nnn]
          *
          * Author: Ahmed M. Naas (ahmed@oea.xs4all.nl)
          * Many changes: aeb@cwi.nl [changed until it would handle all
          * *.cpi files people have sent me; I have no documentation,
          * so all this is experimental]
          * Remains to do: DRDOS fonts.
          *
          * Copyright: Public domain.
          */

          #include <stdio.h>
          #include <stdlib.h>
          #include <string.h>
          #include <unistd.h>

          int handle_codepage(int);
          void handle_fontfile(void);

          #define PACKED __attribute__ ((packed))
          /* Use this (instead of the above) to compile under MSDOS */
          /*#define PACKED */

          struct {
          unsigned char id[8] PACKED;
          unsigned char res[8] PACKED;
          unsigned short num_pointers PACKED;
          unsigned char p_type PACKED;
          unsigned long offset PACKED;
          } FontFileHeader;

          struct {
          unsigned short num_codepages PACKED;
          } FontInfoHeader;

          struct {
          unsigned short size PACKED;
          unsigned long off_nexthdr PACKED;
          unsigned short device_type PACKED; /* screen=1; printer=2 */
          unsigned char device_name[8] PACKED;
          unsigned short codepage PACKED;
          unsigned char res[6] PACKED;
          unsigned long off_font PACKED;
          } CPEntryHeader;

          struct {
          unsigned short reserved PACKED;
          unsigned short num_fonts PACKED;
          unsigned short size PACKED;
          } CPInfoHeader;

          struct {
          unsigned char height PACKED;
          unsigned char width PACKED;
          unsigned short reserved PACKED;
          unsigned short num_chard PACKED;
          } ScreenFontHeader;

          struct {
          unsigned short p1 PACKED;
          unsigned short p2 PACKED;
          } PrinterFontHeader;

          FILE *in, *out;
          void usage(void);

          int opta, optc, optl, optL, optx;
          extern int optind;
          extern char *optarg;

          unsigned short codepage;

          int main (int argc, char *argv[])
          {
          if (argc < 2)
          usage();

          if ((in = fopen(argv[1], "r")) == NULL) {
          printf("\nUnable to open file %s.\n", argv[1]);
          exit(0);
          }

          opta = optc = optl = optL = optx = 0;
          optind = 2;
          if (argc == 2)
          optl = 1;
          else
          while(1) {
          switch(getopt(argc, argv, "alLc")) {
          case 'a':
          opta = 1;
          continue;
          case 'c':
          optc = 1;
          continue;
          case 'L':
          optL = 1;
          continue;
          case 'l':
          optl = 1;
          continue;
          case '?':
          default:
          usage();
          case -1:
          break;
          }
          break;
          }
          if (optind != argc) {
          if (optind != argc-1 || opta)
          usage();
          codepage = atoi(argv[optind]);
          optx = 1;
          }

          if (optc)
          handle_codepage(0);
          else
          handle_fontfile();

          if (optx) {
          printf("no page %d found\n", codepage);
          exit(1);
          }

          fclose(in);
          return (0);
          }

          void
          handle_fontfile(){
          int i, j;

          j = fread(, 1, sizeof(FontFileHeader), in);
          if (j != sizeof(FontFileHeader)) {
          printf("error reading FontFileHeader - got %d chars\n", j);
          exit (1);
          }
          if (!strcmp(FontFileHeader.id + 1, "DRFONT ")) {
          printf("this program cannot handle DRDOS font files\n");
          exit(1);
          }
          if (optL)
          printf("FontFileHeader: id=%8.8s res=%8.8s num=%d typ=%c offset=%ld\n\n",
          FontFileHeader.id, FontFileHeader.res,
          FontFileHeader.num_pointers,
          FontFileHeader.p_type,
          FontFileHeader.offset);

          j = fread(, 1, sizeof(FontInfoHeader), in);
          if (j != sizeof(FontInfoHeader)) {
          printf("error reading FontInfoHeader - got %d chars\n", j);
          exit (1);
          }
          if (optL)
          printf("FontInfoHeader: num_codepages=%d\n\n",
          FontInfoHeader.num_codepages);

          for (i = FontInfoHeader.num_codepages; i; i--)
          if (handle_codepage(i-1))
          break;
          }

          int
          handle_codepage(int more_to_come) {
          int j;
          char outfile[20];
          unsigned char *fonts;
          long inpos, nexthdr;

          j = fread(, 1, sizeof(CPEntryHeader), in);
          if (j != sizeof(CPEntryHeader)) {
          printf("error reading CPEntryHeader - got %d chars\n", j);
          exit(1);
          }
          if (optL) {
          int t = CPEntryHeader.device_type;
          printf("CPEntryHeader: size=%d dev=%d [%s] name=%8.8s \
          codepage=%d\n\t\tres=%6.6s nxt=%ld off_font=%ld\n\n",
          CPEntryHeader.size,
          t, (t==1) ? "screen" : (t==2) ? "printer" : "?",
          CPEntryHeader.device_name,
          CPEntryHeader.codepage,
          CPEntryHeader.res,
          CPEntryHeader.off_nexthdr, CPEntryHeader.off_font);
          } else if (optl) {
          printf("\nCodepage = %d\n", CPEntryHeader.codepage);
          printf("Device = %.8s\n", CPEntryHeader.device_name);
          }
          #if 0
          if (CPEntryHeader.size != sizeof(CPEntryHeader)) {
          /* seen 26 and 28, so that the difference below is -2 or 0 */
          if (optl)
          printf("Skipping %d bytes of garbage\n",
          CPEntryHeader.size - sizeof(CPEntryHeader));
          fseek(in, CPEntryHeader.size - sizeof(CPEntryHeader),
          SEEK_CUR);
          }
          #endif
          if (!opta && (!optx || CPEntryHeader.codepage != codepage) && !optc)
          goto next;

          inpos = ftell(in);
          if (inpos != CPEntryHeader.off_font && !optc) {
          if (optL)
          printf("pos=%ld font at %ld\n", inpos, CPEntryHeader.off_font);
          fseek(in, CPEntryHeader.off_font, SEEK_SET);
          }

          j = fread(, 1, sizeof(CPInfoHeader), in);
          if (j != sizeof(CPInfoHeader)) {
          printf("error reading CPInfoHeader - got %d chars\n", j);
          exit(1);
          }
          if (optl) {
          printf("Number of Fonts = %d\n", CPInfoHeader.num_fonts);
          printf("Size of Bitmap = %d\n", CPInfoHeader.size);
          }
          if (CPInfoHeader.num_fonts == 0)
          goto next;
          if (optc)
          return 0;

          sprintf(outfile, "%d.cp", CPEntryHeader.codepage);
          if ((out = fopen(outfile, "w")) == NULL) {
          printf("\nUnable to open file %s.\n", outfile);
          exit(1);
          } else printf("\nWriting %s\n", outfile);

          fonts = (unsigned char *) malloc(CPInfoHeader.size);

          fread(fonts, CPInfoHeader.size, 1, in);
          fwrite(, sizeof(CPEntryHeader), 1, out);
          fwrite(, sizeof(CPInfoHeader), 1, out);
          j = fwrite(fonts, 1, CPInfoHeader.size, out);
          if (j != CPInfoHeader.size) {
          printf("error writing %s - wrote %d chars\n", outfile, j);
          exit(1);
          }
          fclose(out);
          free(fonts);
          if (optx) exit(0);
          next:
          /*
          * It seems that if entry headers and fonts are interspersed,
          * then nexthdr will point past the font, regardless of
          * whether more entries follow.
          * Otherwise, first all entry headers are given, and then
          * all fonts; in this case nexthdr will be 0 in the last entry.
          */
          nexthdr = CPEntryHeader.off_nexthdr;
          if (nexthdr == 0 || nexthdr == -1) {
          if (more_to_come) {
          printf("mode codepages expected, but nexthdr=%ld\n",
          nexthdr);
          exit(1);
          } else
          return 1;
          }

          inpos = ftell(in);
          if (inpos != CPEntryHeader.off_nexthdr) {
          if (optL)
          printf("pos=%ld nexthdr at %ld\n", inpos, nexthdr);
          if (opta && !more_to_come) {
          printf("no more code pages, but nexthdr != 0\n");
          return 1;
          }

          fseek(in, CPEntryHeader.off_nexthdr, SEEK_SET);
          }

          return 0;
          }

          void usage(void)
          {
          printf("\nUsage: cpi code_page_file [-c] [-L] [-l] [-a|nnn]\n");
          printf(" -c: input file is a single codepage\n");
          printf(" -L: print header info (you don't want to see this)\n");
          printf(" -l or no option: list all codepages contained in the file\n");
          printf(" -a: extract all codepages from the file\n");
          printf(" nnn (3 digits): extract codepage nnn from the file\n");
          printf("Example: cpi ega.cpi 850 \n");
          printf(" will create a file 850.cp containing the requested codepage.\n\n");
          exit(1);
          }
          ?
          ?
          ?
          Code Page?? Character Set??語種 

          708???????? ASMO-708???阿拉伯字符 (ASMO 708)
          720???????? DOS-720???阿拉伯字符 (DOS)
          28596?????? iso-8859-6???阿拉伯字符 (ISO)
          1256??????? windows-1256??阿拉伯字符 (Windows)
          1257??????? windows-1257??波羅的海字符 (Windows)
          852???????? ibm852???中歐字符 (DOS)
          28592?????? iso-8859-2???中歐字符 (ISO)
          1250??????? windows-1250??中歐字符 (Windows)
          936???????? gb2312???簡體中文 (GB2312)
          950???????? big5???繁體中文 (Big5)
          862???????? DOS-862???希伯來字符 (DOS)
          866???????? cp866???西里爾字符 (DOS)
          874???????? windows-874???泰語 (Windows)
          932???????? shift_jis???日語 (Shift-JIS)
          949???????? ks_c_5601-1987??朝鮮語
          1251??????? windows-1251??西里爾字符 (Windows)
          1252??????? iso-8859-1???西歐字符
          1253??????? windows-1253??希臘字符 (Windows)
          1254??????? iso-8859-9???土耳其字符 (Windows)
          1255??????? windows-1255??希伯來字符 (Windows)
          1258??????? windows-1258??越南字符 (Windows)
          20866?????? koi8-r???西里爾字符 (KOI8-R)
          21866?????? koi8-ru???西里爾字符 (KOI8-U)
          28595?????? iso-8859-5???西里爾字符 (ISO)
          28597?????? iso-8859-7???希臘字符 (ISO)
          28598?????? iso-8859-8???希伯來字符 (ISO-Visual)
          38598?????? iso-8859-8-i??希伯來字符 (ISO-Logical)
          50932?????? _autodetect???日語 (自動選擇)
          51932?????? euc-jp???日語 (EUC)
          52936?????? hz-gb-2312???簡體中文 (HZ)
          65001?????? utf-8???Unicode (UTF-8)
        57. posted on 2007-12-04 09:57 flyepp 閱讀(690) 評論(0)  編輯  收藏

          只有注冊用戶登錄后才能發表評論。


          網站導航:
          博客園   IT新聞   Chat2DB   C++博客   博問  
           
          主站蜘蛛池模板: 鄄城县| 锡林郭勒盟| 通河县| 施秉县| 潢川县| 温泉县| 松阳县| 江油市| 都兰县| 勃利县| 巢湖市| 邢台县| 平乡县| 呈贡县| 汝阳县| 涞源县| 平利县| 江门市| 横峰县| 夹江县| 蒙阴县| 共和县| 无锡市| 丹棱县| 珲春市| 平昌县| 延吉市| 云霄县| 南岸区| 五寨县| 松溪县| 柘城县| 东乌| 司法| 武邑县| 迭部县| 广汉市| 漯河市| 大丰市| 本溪市| 科技|