首頁新隨筆新文章聯系聚合

posts - 495,comments - 227,trackbacks - 0

<

2016年1月

>

日

一

二

三

四

五

六

27

28

29

30

31

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

1

2

3

4

5

6

常用鏈接

留言簿(46)

隨筆分類(476)

隨筆檔案(495)

搜索

積分與排名

積分 - 1397550
排名 - 16

閱讀排行榜

評論排行榜

利用中文數據跑Google開源項目word2vec

http://www.cnblogs.com/hebin/p/3507609.html

一直聽說word2vec在處理詞與詞的相似度的問題上效果十分好，最近自己也上手跑了跑Google開源的代碼（https://code.google.com/p/word2vec/）。

1、語料

首先準備數據：采用網上博客上推薦的全網新聞數據(SogouCA)，大小為2.1G。

從ftp上下載數據包SogouCA.tar.gz：

1 wget ftp://ftp.labs.sogou.com/Data/SogouCA/SogouCA.tar.gz --ftp-user=hebin_hit@foxmail.com --ftp-password=4FqLSYdNcrDXvNDi -r

解壓數據包：

1 gzip -d SogouCA.tar.gz 2 tar -xvf SogouCA.tar

再將生成的txt文件歸并到SogouCA.txt中，取出其中包含content的行并轉碼，得到語料corpus.txt，大小為2.7G。

1 cat *.txt > SogouCA.txt 2 cat SogouCA.txt | iconv -f gbk -t utf-8 -c | grep "<content>" > corpus.txt

2、分詞

用ANSJ對corpus.txt進行分詞，得到分詞結果resultbig.txt，大小為3.1G。

分詞工具ANSJ參見 http://blog.csdn.net/zhaoxinfan/article/details/10403917

在分詞工具seg_tool目錄下先編譯再執行得到分詞結果resultbig.txt，內含426221個詞，次數總計572308385個。

分詞結果：

3、用word2vec工具訓練詞向量

1 nohup ./word2vec -train resultbig.txt -output vectors.bin -cbow 0 -size 200 -window 5 -negative 0 -hs 1 -sample 1e-3 -threads 12 -binary 1 &

vectors.bin是word2vec處理resultbig.txt后生成的詞的向量文件，在實驗室的服務器上訓練了1個半小時。

4、分析

4.1 計算相似的詞：

1 ./distance vectors.bin

./distance可以看成計算詞與詞之間的距離，把詞看成向量空間上的一個點，distance看成向量空間上點與點的距離。

下面是一些例子：

4.2 潛在的語言學規律

在對demo-analogy.sh修改后得到下面幾個例子：

法國的首都是巴黎，英國的首都是倫敦， vector("法國") - vector("巴黎) + vector("英國") --> vector("倫敦")"

4.3 聚類

將經過分詞后的語料resultbig.txt中的詞聚類并按照類別排序：

1 nohup ./word2vec -train resultbig.txt -output classes.txt -cbow 0 -size 200 -window 5 -negative 0 -hs 1 -sample 1e-3 -threads 12 -classes 500  & 2 sort classes.txt -k 2 -n > classes_sorted_sogouca.txt

例如：

4.4 短語分析

先利用經過分詞的語料resultbig.txt中得出包含詞和短語的文件sogouca_phrase.txt，再訓練該文件中詞與短語的向量表示。

1 ./word2phrase -train resultbig.txt -output sogouca_phrase.txt -threshold 500 -debug 2 2 ./word2vec -train sogouca_phrase.txt -output vectors_sogouca_phrase.bin -cbow 0 -size 300 -window 10 -negative 0 -hs 1 -sample 1e-3 -threads 12 -binary 1

下面是幾個計算相似度的例子：

5、參考鏈接：

1. word2vec：Tool for computing continuous distributed representations of words，https://code.google.com/p/word2vec/

2. 用中文把玩Google開源的Deep-Learning項目word2vec，http://www.cnblogs.com/wowarsenal/p/3293586.html

3. 利用word2vec對關鍵詞進行聚類，http://blog.csdn.net/zhaoxinfan/article/details/11069485

6、后續準備仔細閱讀的文獻：

[1] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient Estimation of Word Representations in Vector Space. In Proceedings of Workshop at ICLR, 2013.
[2] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. Distributed Representations of Words and Phrases and their Compositionality. In Proceedings of NIPS, 2013.
[3] Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. Linguistic Regularities in Continuous Space Word Representations. In Proceedings of NAACL HLT, 2013.

[4] Collobert R, Weston J, Bottou L, et al. Natural language processing (almost) from scratch[J]. The Journal of Machine Learning Research, 2011, 12: 2493-2537.

posted on 2016-01-13 13:49 SIMONE 閱讀(1398) 評論(0) 編輯收藏

新用戶注冊刷新評論列表


只有注冊用戶登錄后才能發表評論。




網站導航: 博客園 IT新聞 Chat2DB C++博客博問管理