ivaneeo's blog

自由的力量，自由的生活。

:: 管理

669 Posts :: 0 Stories :: 64 Comments :: 0 Trackbacks

勤奮的 litaocheng 同學，在每日超負荷的加班工作之余，仍然刻苦學習筆耕不輟，為我們不斷帶來勁爆文章，這一篇《Erlang的Unicode支持》為我們介紹了 R13 的最新特性，也是最被大家期望的特性——內置的 Unicode 支持。廢話少說，直接上正文。

在R13A中， Erlang加入了對Unicode的支持。本文涉及到的數據類型包括：list, binary，涉及到的模塊包括stdlib/unicode, stdlib/io, kernel/file。

Binary

Binary的type屬性增加了utf相關的type：utf8, utf16, utf32，其分別對應UTF8， UTF16，UTF32編碼。

Binary Constructing

在Binary構建時，如果指定了utf相關類型，那么對應的integer的Value必須位于：0..16#D7FF, 16#E000..16#FFFD, 或者 16#10000..16#10FFFF這三個區間中。否則將會提示’bad argument’，參數錯誤。根據指定的的utf類型不同，同一個數據產生的binary不同。

對于utf8，每個integer生成1到4個字符；對于utf16，每個integer生成2或4個字符；對于utf32，每個integer生成4個字符。

比如, 使用unicode為1024的字符A, 構建一個binary：

1> <<1024/utf8>>.    

<<208,128>>

2> <<1024/utf16>>.

<<4,0>>

3> <<1024/utf32>>.

<<0,0,4,0>>

Binary Match

當進行Binary Match時，如果指定utf相關類型，變量成功匹配后，將擁有一個位于：0..16#D7FF, 16#E000..16#FFFD, 或者 16#10000..16#10FFFF這三個區間中的integer。

其更具utf類型的不同，消耗（match）不同數目的bytes。

utf8匹配1－4個bytes（參考RFC-2279）
utf16匹配2 或 4 個bytes (參考 RFC-2781）
utf32匹配4個 bytes

比如：繼續我們上面的例子

4> Bin = <<1024/utf8>>.

<<208,128>>

5> <<U/utf8>> = Bin.

<<208,128>>

6> U.

1024

這個例子中，U匹配了2個bytes。對于utf相關類型，不能指定unit spec。

List

在list中，每個unicode字符采用integer來表示，因此與latin1的list相比，unicode list中，element的數值可以大于255。下面就是一個有效的unicode list: [1024, 1025]

我們可以通過unicode 模塊實現 list到binary的轉換。

unicode module

首先請參看下面的type定義：

unicode_binary() = binary() with characters encoded in UTF-8 coding standard
unicode_char() = integer() representing valid unicode codepoint
chardata() = charlist() | unicode_binary()
charlist() = [unicode_char() | unicode_binary() | charlist()]
a unicode_binary is allowed as the tail of the list

external_unicode_binary() = binary() with characters coded in a user specified Unicode encoding other than UTF-8 (UTF-16 or UTF-32)
external_chardata() = external_charlist() | external_unicode_binary()
external_charlist() = [unicode_char() | external_unicode_binary() | external_charlist()]
an external_unicode_binary is allowed as the tail of the list

latin1_binary() = binary() with characters coded in iso-latin-1
latin1_char() = integer() representing valid latin1 character (0-255)
latin1_chardata() = latin1_charlist() | latin1_binary()
latin1_charlist() = [latin1_char() | latin1_binary() | latin1_charlist()]
a latin1_binary is allowed as the tail of the list

我們可以調用unicode:characters_to_list/1 將chardata或latin1_chardata或external_chardata()轉化成一個unicode list。

如果參數為latin1_chardata，那么Data參數就是一個iodata. 返回的結果list中，每個element為一個integer。默認情況 unicode:characters_to_list/1調用unicode:characters_to_list(Data, unicode)

如果我們的CharData為其他類型，我們可以指明InEncoding type。如果此函數執行成功，返回{ok, List}，如果失敗返回{error, list(), RestData}，其中list為轉化成功的部分，RestData為發生錯誤的位置。

我們也可以調用unicode:characters_to_binary/1，將chardata或latin1_chardata或 external_chardata()轉化成一個binary。這個函數和unicode:characters_to_list類似，只是結果保存為 binary。

如果Data為latin1_chardata, 那么unicode:characters_to_binary/1和 erlang:iolist_to_binary/1功能相同

unicode模塊中，還有兩個于bom相關的函數，可以根據bom指返回對應的encoding類型，也可以根據encoding類型生成對應的bom值。其在保存文件時，經常使用.

Examples

1, 打開utf8保存的文件

文件內容如下test.file:

[

｛desc, "這是一個測試文件"},

{author, "litaocheng"}

].

其格式為erlang term，保存時選擇utf8編碼。
代碼如下：

%% read content from the file
test1() ->
    {ok, [Terms]} = file:consult("test.txt"),
    Desc = proplists:get_value(desc, Terms),
    _Author = proplists:get_value(author, Terms),
   
    % out put the Desc and Author
    DescUniBin = iolist_to_binary(Desc),
    DescUniList = unicode:characters_to_list(DescUniBin),
    io:format("desc bin : ~ts~ndesc bin : ~p~n",[DescUniBin,DescUniBin]),
    io:format("desc list: ~ts~ndesc list: ~p~n", [DescUniList,DescUniList]).

結果：

desc bin : 這是一個測試文件

desc bin : <<232,191,153,230,152,175,228,184,128,228,184,170,230,181,139,232,

             175,149,230,150,135,228,187,182>>

desc list: 這是一個測試文件

desc list: [36825,26159,19968,20010,27979,35797,25991,20214]

首先將內容從list轉換為binary, DescUniBin 便是對應的unicode binary。隨后通過unicode:characters_to_list/1轉化為unicode list最后輸出。
我們可以看到 unicode list中所有的element為integer， unicode binary中unicode string采用uft8編碼。

2, 將數據保存成uft8格式

%% save the binary in utf8 format
test2() ->
    [DescList] = io_lib:format("~ts", ["這是一個測試文件"]),
    DescBin = erlang:iolist_to_binary(DescList),
    DescList2 = unicode:characters_to_list(DescBin),
    List = lists:concat(["[{desc,\"", DescList2, "\"}, {author,\"litaocheng\"}]."]),
    Bin = unicode:characters_to_binary(List),
    io:format("bin is:~ts~n", [Bin]),
    file:write_file("test_out.txt", Bin).

這個例子的完整代碼如下：

下載: unicode_test.erl

-module(unicode_test).
-compile([export_all]).
 
%%
%% the test.txt content:
%% [
%% {desc, "這是一個測試文件"},
%% {author, "litaocheng"}
%% ].
%%
 
test() ->
    test2(),
    test1().
 
%% read content from the file
test1() ->
    {ok, [Term]} = file:consult("test_out.txt"),
    Desc = proplists:get_value(desc, Term),
    _Author = proplists:get_value(author, Term),
   
    % out put the Desc and Author
    DescUniBin = iolist_to_binary(Desc),
    DescUniList = unicode:characters_to_list(DescUniBin),
    io:format("desc bin : ~ts~ndesc bin : ~p~n",[DescUniBin,DescUniBin]),
    io:format("desc list: ~ts~ndesc list: ~p~n", [DescUniList,DescUniList]).
 
 
%% save the binary in utf8 format
test2() ->
    [DescList] = io_lib:format("~ts", ["這是一個測試文件"]),
    DescBin = erlang:iolist_to_binary(DescList),
    DescList2 = unicode:characters_to_list(DescBin),
    List = lists:concat(["[{desc,\"", DescList2, "\"}, {author,\"litaocheng\"}]."]),
    Bin = unicode:characters_to_binary(List),
    io:format("bin is:~ts~n", [Bin]),
    file:write_file("test_out.txt", Bin).

posted on 2009-10-28 15:19 ivaneeo 閱讀(1413) 評論(1) 編輯收藏所屬分類: erlang-分布式語言

新用戶注冊刷新評論列表


只有注冊用戶登錄后才能發表評論。




網站導航: 博客園 IT新聞 Chat2DB C++博客博問管理
相關文章: RabbitMQ 3.3.0遠程guest能訪問 Starting a set of Erlang cluster nodes Erlang的Unicode支持【轉】erlang 網絡調優實戰 Unit Test in Erlang Erlang: Let’s talk to java

ivaneeo's blog

常用鏈接

留言簿(34)

我參與的團隊

隨筆分類

隨筆檔案

搜索

最新評論

閱讀排行榜

評論排行榜