Change Dir

先知cd——熱愛生活是一切藝術(shù)的開始

導(dǎo)航

<

2012年4月

>

日

一

二

三

四

五

六

25

26

27

28

29

30

31

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

1

2

3

4

5

公告

寫下來的都是資源，分享給互聯(lián)網(wǎng)~~均屬原創(chuàng)隨筆。
轉(zhuǎn)載引用請注明作者changedi。
喜歡應(yīng)用研究，熱愛編程，歡迎交流。

隨筆分類(125)

隨筆檔案(123)

統(tǒng)計

隨筆 - 222
文章 - 0
評論 - 182
引用 - 0

留言簿(18)

積分與排名

積分 - 421899
排名 - 132

閱讀排行榜

評論排行榜

weka特征預(yù)處理的一些tip

首先，提供兩個地址，這里包含了全部的內(nèi)容原文：
http://weka.wikispaces.com/Text+categorization+with+Weka
http://weka.wikispaces.com/ARFF+files+from+Text+Collections

weka可以以目錄形式讀入數(shù)據(jù)。
然后再簡單說一下weka在做文本特征內(nèi)容處理時候需要注意的東西：
聲明一點(diǎn)，在weka的gui下是沒法使用這個功能的：以目錄形式讀入數(shù)據(jù)。
首先，把要處理的數(shù)據(jù)寫入到這樣的目錄結(jié)構(gòu)下：

...
|
+- text_example
|
+- class1
|  |
|  + file1.txt
|  |
|  + file2.txt
|  |
|  ...
|
+- class2
|  |
|  + another_file1.txt
|  |
|  + another_file2.txt
|  |
|  ...

然后在源碼包下，命令行執(zhí)行 java weka.core.converters.TextDirectoryLoader -dir text_example > text_example.arff
其中text_example就是數(shù)據(jù)所在的目錄，而后面的arff文件就是生成的arff文件。另外值得補(bǔ)充的一點(diǎn)是在獲得這樣的arff后哦，文本內(nèi)容是作為一個字符串特征存在的，也就是說生成的arff就是一個特征項加一個類標(biāo)簽，其中的類標(biāo)就是text_example目錄下級classX子目錄的名字。為了更方便使用，weka提供了一個有監(jiān)督的屬性過濾器，幫助分詞（這里指英文的split） ——StringToWordVector，這個是可以做TF/IDF的~~~
下面的簡單代碼可以完成一個分類：

1

import weka.core.*;
2

import weka.core.converters.*;
3

import weka.classifiers.trees.*;
4

import weka.filters.*;
5

import weka.filters.unsupervised.attribute.*;
6

7

import java.io.*;
8

9

/**
10

* Example class that converts HTML files stored in a directory structure into
11

* and ARFF file using the TextDirectoryLoader converter. It then applies the
12

* StringToWordVector to the data and feeds a J48 classifier with it.
13

*
14

* @author FracPete (fracpete at waikato dot ac dot nz)
15

*/
16

public class TextCategorizationTest {
17

18

/**
19

* Expects the first parameter to point to the directory with the text files.
20

* In that directory, each sub-directory represents a class and the text
21

* files in these sub-directories will be labeled as such.
22

*
23

* @param args the commandline arguments
24

* @throws Exception if something goes wrong
25

*/
26

public static void main(String[] args) throws Exception {
27

// convert the directory into a dataset
28

TextDirectoryLoader loader = new TextDirectoryLoader();
29

loader.setDirectory(new File("./text_example"));
30

Instances dataRaw = loader.getDataSet();
31

System.out.println("\n\nImported data:\n\n" + dataRaw.numClasses());
32

33

// apply the StringToWordVector
34

// (see the source code of setOptions(String[]) method of the filter
35

// if you want to know which command-line option corresponds to which
36

// bean property)
37

StringToWordVector filter = new StringToWordVector();
38

filter.setInputFormat(dataRaw);
39

Instances dataFiltered = Filter.useFilter(dataRaw, filter);
40

System.out.println("\n\nFiltered data:\n\n" + dataFiltered);
41

42

// train J48 and output model
43

J48 classifier = new J48();
44

classifier.buildClassifier(dataFiltered);
45

System.out.println("\n\nClassifier model:\n\n" + classifier);
46

}
47

}
48

最后，我還是建議數(shù)據(jù)建模和生成都自己寫程序，數(shù)據(jù)準(zhǔn)備往往自己的程序才能準(zhǔn)確的控制，weka最多是幫我們做一下selection和classification。
另外補(bǔ)充一點(diǎn)，很多朋友問到了如何做文本分類，好吧，如果大家懶得去讀paper的話，首先我普及一點(diǎn)，不管什么分類，分類器基本是可以通用的，注意是基本。關(guān)鍵是模型的構(gòu)建和特征的生成。至于文本分類中用到的特征，TF*IDF還有其他如互信息，卡方統(tǒng)計，期望交叉熵等等，公式擺在那里，計算真的不難。因為就我接觸過的分類問題，文本分類的特征計算應(yīng)該是很容易的了。

posted on 2012-04-24 16:09 changedi 閱讀(3920) 評論(0) 編輯收藏所屬分類: 機(jī)器學(xué)習(xí)

新用戶注冊刷新評論列表


只有注冊用戶登錄后才能發(fā)表評論。




網(wǎng)站導(dǎo)航: 博客園 IT新聞 Chat2DB C++博客博問管理
相關(guān)文章: 決策樹和Random Forests——優(yōu)秀的群體智慧 Logistic Regression——用線解決問題 weka定制計劃已添加到github weka特征預(yù)處理的一些tip weka的java使用(3)——特征選擇 weka的java使用(2)——分類 weka的java使用(1)——聚類貝葉斯決策——總結(jié)筆記