Change Dir

先知cd——熱愛生活是一切藝術的開始

公告

寫下來的都是資源，分享給互聯網~~均屬原創隨筆。
轉載引用請注明作者changedi。
喜歡應用研究，熱愛編程，歡迎交流。

隨筆分類(125)

隨筆檔案(123)

統計

隨筆 - 222
文章 - 0
評論 - 182
引用 - 0

留言簿(18)

積分與排名

積分 - 421898
排名 - 132

閱讀排行榜

評論排行榜

weka特征預處理的一些tip

首先，提供兩個地址，這里包含了全部的內容原文：
http://weka.wikispaces.com/Text+categorization+with+Weka
http://weka.wikispaces.com/ARFF+files+from+Text+Collections

weka可以以目錄形式讀入數據。
然后再簡單說一下weka在做文本特征內容處理時候需要注意的東西：
聲明一點，在weka的gui下是沒法使用這個功能的：以目錄形式讀入數據。
首先，把要處理的數據寫入到這樣的目錄結構下：

...
|
+- text_example
|
+- class1
|  |
|  + file1.txt
|  |
|  + file2.txt
|  |
|  ...
|
+- class2
|  |
|  + another_file1.txt
|  |
|  + another_file2.txt
|  |
|  ...

然后在源碼包下，命令行執行 java weka.core.converters.TextDirectoryLoader -dir text_example > text_example.arff
其中text_example就是數據所在的目錄，而后面的arff文件就是生成的arff文件。另外值得補充的一點是在獲得這樣的arff后哦，文本內容是作為一個字符串特征存在的，也就是說生成的arff就是一個特征項加一個類標簽，其中的類標就是text_example目錄下級classX子目錄的名字。為了更方便使用，weka提供了一個有監督的屬性過濾器，幫助分詞（這里指英文的split） ——StringToWordVector，這個是可以做TF/IDF的~~~
下面的簡單代碼可以完成一個分類：

1

import weka.core.*;
2

import weka.core.converters.*;
3

import weka.classifiers.trees.*;
4

import weka.filters.*;
5

import weka.filters.unsupervised.attribute.*;
6

7

import java.io.*;
8

9

/**
10

* Example class that converts HTML files stored in a directory structure into
11

* and ARFF file using the TextDirectoryLoader converter. It then applies the
12

* StringToWordVector to the data and feeds a J48 classifier with it.
13

*
14

* @author FracPete (fracpete at waikato dot ac dot nz)
15

*/
16

public class TextCategorizationTest {
17

18

/**
19

* Expects the first parameter to point to the directory with the text files.
20

* In that directory, each sub-directory represents a class and the text
21

* files in these sub-directories will be labeled as such.
22

*
23

* @param args the commandline arguments
24

* @throws Exception if something goes wrong
25

*/
26

public static void main(String[] args) throws Exception {
27

// convert the directory into a dataset
28

TextDirectoryLoader loader = new TextDirectoryLoader();
29

loader.setDirectory(new File("./text_example"));
30

Instances dataRaw = loader.getDataSet();
31

System.out.println("\n\nImported data:\n\n" + dataRaw.numClasses());
32

33

// apply the StringToWordVector
34

// (see the source code of setOptions(String[]) method of the filter
35

// if you want to know which command-line option corresponds to which
36

// bean property)
37

StringToWordVector filter = new StringToWordVector();
38

filter.setInputFormat(dataRaw);
39

Instances dataFiltered = Filter.useFilter(dataRaw, filter);
40

System.out.println("\n\nFiltered data:\n\n" + dataFiltered);
41

42

// train J48 and output model
43

J48 classifier = new J48();
44

classifier.buildClassifier(dataFiltered);
45

System.out.println("\n\nClassifier model:\n\n" + classifier);
46

}
47

}
48

最后，我還是建議數據建模和生成都自己寫程序，數據準備往往自己的程序才能準確的控制，weka最多是幫我們做一下selection和classification。
另外補充一點，很多朋友問到了如何做文本分類，好吧，如果大家懶得去讀paper的話，首先我普及一點，不管什么分類，分類器基本是可以通用的，注意是基本。關鍵是模型的構建和特征的生成。至于文本分類中用到的特征，TF*IDF還有其他如互信息，卡方統計，期望交叉熵等等，公式擺在那里，計算真的不難。因為就我接觸過的分類問題，文本分類的特征計算應該是很容易的了。

posted on 2012-04-24 16:09 changedi 閱讀(3920) 評論(0) 編輯收藏所屬分類: 機器學習

新用戶注冊刷新評論列表


只有注冊用戶登錄后才能發表評論。




網站導航: 博客園 IT新聞 Chat2DB C++博客博問管理
相關文章: 決策樹和Random Forests——優秀的群體智慧 Logistic Regression——用線解決問題 weka定制計劃已添加到github weka特征預處理的一些tip weka的java使用(3)——特征選擇 weka的java使用(2)——分類 weka的java使用(1)——聚類貝葉斯決策——總結筆記

Change Dir

導航

公告

隨筆分類(125)

隨筆檔案(123)

統計

留言簿(18)

積分與排名

“牛”們的博客

各個公司技術

我的鏈接

淘寶技術

閱讀排行榜

評論排行榜

weka特征預處理的一些tip