四虎亚洲精品,中文字幕在线影院,国产日韩欧美一区二区

weka定制计划已添加到github

changedi — Tue, 28 May 2013 03:46:00 GMT

今天把weka3.7.0官方的开发版本添加到github�Q�有需要的同学可以��M��载��用，其中我已�l�配�|�好libsvm和liblinear�Q�聚�cȝ��clusterEvaluation也定制输��Z��(ji��n)一些额外的信息比如错误聚类的原始类标和聚类�c�L��的对比（�q�个功能可以帮助我们定位到类似EM或者KMEANS��法聚类�l�果中哪些instance被标记的�c�d��Q��?br />另外�Q�对weka感兴��的朋友也欢�q��A(ch��)献代码和��x(ch��ng)��需求，我可以帮助实现。未来我�?x��)不定期的新增一些weka的定�Ӟ��以及(qi��ng)在源代码层��做一些中文注释辅助应用者��用�?br />
我的weka github地址�Q?a >https://github.com/changedi/weka�Q�只读git路径�Q�git://github.com/changedi/weka.git�Q�欢�q�fork

changedi 2013-05-28 11:46 发表评论

weka特征预处理的一些tip

changedi — Tue, 24 Apr 2012 08:09:00 GMT

首先�Q�提供两个地址�Q�这里包含了(ji��n)全部的内容原文：(x��)
http://weka.wikispaces.com/Text+categorization+with+Weka
http://weka.wikispaces.com/ARFF+files+from+Text+Collections

weka可以以目录�Ş式读入数�?/strong>�?br />然后再简单说一下weka在做文本特征内容处理时候需要注意的东西�Q?br />声明一点，在weka的gui下是没法使用�q�个功能的：(x��)以目录�Ş式读入数据�?br />首先�Q�把要处理的数据写入到这��L(f��ng)��目录�l�构下：(x��)
... | +- text_example | +- class1 | | | + file1.txt | | | + file2.txt | | | ... | +- class2 | | | + another_file1.txt | | | + another_file2.txt | | | ...

然后在源码包下，命��o(h��)行执�?java weka.core.converters.TextDirectoryLoader -dir text_example > text_example.arff
其中text_example��是数据所在的目录�Q�而后面的arff文�g��是生成的arff文�g。另外值得补充的一�Ҏ(gu��)��在获得这��L(f��ng)��arff后哦�Q�文本内�Ҏ(gu��)��作�ؓ(f��)一个字�W�串特征存在的，也就是说生成的arff��是一个特征项加一个类标签�Q�其中的�c�L��是text_example目录下��classX子目录的名字。�ؓ(f��)�?ji��n)更方便使用�Q�weka提供�?ji��n)一个有监督的属性过滤器�Q�帮助分词（�q�里指英文的split�Q?——StringToWordVector�Q�这个是可以做TF/IDF的~~~
下面的简单代码可以完成一个分�c�：(x��)
1import weka.core.*;
2import weka.core.converters.*;
3import weka.classifiers.trees.*;
4import weka.filters.*;
5import weka.filters.unsupervised.attribute.*;
6
7import java.io.*;
8
9/** *//**
10 * Example class that converts HTML files stored in a directory structure into
11 * and ARFF file using the TextDirectoryLoader converter. It then applies the
12 * StringToWordVector to the data and feeds a J48 classifier with it.
13 *
14 * @author FracPete (fracpete at waikato dot ac dot nz)
15 */
16public class TextCategorizationTest {
17
18  /** *//**
19   * Expects the first parameter to point to the directory with the text files.
20   * In that directory, each sub-directory represents a class and the text
21   * files in these sub-directories will be labeled as such.
22   *
23   * @param args        the commandline arguments
24   * @throws Exception  if something goes wrong
25   */
26  public static void main(String[] args) throws Exception {
27    // convert the directory into a dataset
28    TextDirectoryLoader loader = new TextDirectoryLoader();
29    loader.setDirectory(new File("./text_example"));
30    Instances dataRaw = loader.getDataSet();
31    System.out.println("\n\nImported data:\n\n" + dataRaw.numClasses());
32
33    // apply the StringToWordVector
34    // (see the source code of setOptions(String[]) method of the filter
35    // if you want to know which command-line option corresponds to which
36    // bean property)
37    StringToWordVector filter = new StringToWordVector();
38    filter.setInputFormat(dataRaw);
39    Instances dataFiltered = Filter.useFilter(dataRaw, filter);
40    System.out.println("\n\nFiltered data:\n\n" + dataFiltered);
41
42    // train J48 and output model
43    J48 classifier = new J48();
44    classifier.buildClassifier(dataFiltered);
45    System.out.println("\n\nClassifier model:\n\n" + classifier);
46  }
47}
48

最后，我还是徏议数据徏模和生成都自己写�E�序�Q�数据准备往往自己的程序才能准��的控制�Q�weka最多是帮我们做一下selection和classification�?br />另外补充一点，很多朋友问到�?ji��n)如何做文本分类�Q�好吧，如果大家懒得去读paper的话�Q�首先我普及(qi��ng)一点，不管什么分�c�，分类器基本是可以通用的，注意是基本。关键是模型的构建和特征的生成。至于文本分�c�M��用到的特征，TF*IDF�q�有其他如互信息�Q�卡方统计，期望交叉�늭��{�，公式摆在那里�Q�计��真的不难。因为就我接触过的分�c�问题，文本分类的特征计��应该是很容易的�?ji��n)�?br />

changedi 2012-04-24 16:09 发表评论

weka的java使用(3)——特征选择

changedi — Tue, 23 Nov 2010 02:06:00 GMT
�l�箋(hu��)weka的编�E�系列。数据挖掘的一个重要的�q�程��是要特征选择�Q�主要作用就是降�l�_(d��)��q�且降低计算的复杂性，摒弃那些可能的潜在噪声。在我的paper中和��士论文中都用到�?ji��n)CFS的特征子集选择�Ҏ(gu��)��Q�配以最佳优先的搜烦(ch��)或者贪�?j��)搜索，�q�样可以��维度比较高的训�l�特征集降维�q�简化，大概用CFS+Best first可以��我的训�l�样本中�?45�l�特征降�?0-50之间�?br /> 具体的实现方法见下面的测试代码（只做�C��用）(j��)�Q?br />
1/** *//**
2 *
3 */
4package edu.tju.ikse.mi.util;
5
6import java.io.File;
7import java.io.IOException;
8import java.util.Random;
9
10import weka.attributeSelection.ASEvaluation;
11import weka.attributeSelection.ASSearch;
12import weka.attributeSelection.AttributeSelection;
13import weka.attributeSelection.BestFirst;
14import weka.attributeSelection.CfsSubsetEval;
15import weka.core.Instances;
16import weka.core.converters.ArffLoader;
17
18/** *//**
19 * @author Jia Yu
20 * @date 2010-11-23
21 */
22public class WekaSelector {
23
24    private ArffLoader loader;
25    private Instances dataSet;
26    private File arffFile;
27    private int sizeOfDataset;
28    private int numOfOldAttributes;
29    private int numOfNewAttributes;
30    private int classIndex;
31    private int[] selectedAttributes;
32
33    public WekaSelector(File file) throws IOException {
34        loader = new ArffLoader();
35        arffFile = file;
36        loader.setFile(arffFile);
37        dataSet = loader.getDataSet();
38        sizeOfDataset = dataSet.numInstances();
39        numOfOldAttributes = dataSet.numAttributes();
40        classIndex = numOfOldAttributes - 1;
41        dataSet.setClassIndex(classIndex);
42    }
43
44    public void select() throws Exception {
45        ASEvaluation evaluator = new CfsSubsetEval();
46        ASSearch search = new BestFirst();
47        AttributeSelection eval = null;
48
49        eval = new AttributeSelection();
50        eval.setEvaluator(evaluator);
51        eval.setSearch(search);
52
53        eval.SelectAttributes(dataSet);
54        numOfNewAttributes = eval.numberAttributesSelected();
55        selectedAttributes = eval.selectedAttributes();
56        System.out.println("result is "+eval.toResultsString());
57        /**//*
58        Random random = new Random(seed);
59        dataSet.randomize(random);
60        if (dataSet.attribute(classIndex).isNominal()) {
61            dataSet.stratify(numFolds);
62        }
63        for (int fold = 0; fold < numFolds; fold++) {
64            Instances train = dataSet.trainCV(numFolds, fold, random);
65            eval.selectAttributesCVSplit(train);
66        }
67        System.out.println("result is "+eval.CVResultsString());
68        */
69        System.out.println("old number of Attributes is "+numOfOldAttributes);
70        System.out.println("new number of Attributes is "+numOfNewAttributes);
71        for(int i=0;i<selectedAttributes.length;i++){
72            System.out.println(selectedAttributes[i]);
73        }
74    }
75
76    /** *//**
77     * @param args
78     */
79    public static void main(String[] args) {
80        // TODO Auto-generated method stub
81        File file = new File("iris.arff");
82        try {
83            WekaSelector ws = new WekaSelector(file);
84            ws.select();
85
86        } catch (IOException e) {
87            // TODO Auto-generated catch block
88            e.printStackTrace();
89        } catch (Exception e) {
90            // TODO Auto-generated catch block
91            e.printStackTrace();
92        }
93
94    }
95
96}
97

其中的注释部分是使用交叉验证的部分。默认是十折?sh��)��叉验证�Q�当然这个可以通过set�Ҏ(gu��)��讄��。具体的使用或者用到reduce dimensionality的方法大家可以参看源代码。毕竟weka开源很是方�ѝ��源代码涉及(qi��ng)到的�c�M��要是查看weka.attributeSelection.AttributeSelection�c�d��可以�?ji��n)。当然如何调用和选择可以看看weka.gui.explorer.AttributeSelectionPanel�c�R�?br />
上面代码的实验结果如下：(x��)

result is

=== Attribute Selection on all input data ===

Search Method:
Best first.
Start set: no attributes
Search direction: forward
Stale search after 5 node expansions
Total number of subsets evaluated: 12
Merit of best subset found:    0.887

Attribute Subset Evaluator (supervised, Class (nominal): 5 class):
CFS Subset Evaluator
Including locally predictive attributes

Selected attributes: 3,4 : 2
                     petallength
                     petalwidth

old number of Attributes is 5
new number of Attributes is 2
2
3
4

原来的iris数据集中共有4个属性（包含一个分�cȝ��标所以一�?�l�_(d��)��(j��)�Q�经�q�特征选择后，只有�W?和第4两个�l�度的特征保留，所以新特征子集有两个维度（不包含类标，有点�l�，不好意思，我��L��q�样�Q��?br /> 最后的2�Q?�Q?是属性数�l�的下标�Q�表�C�经�q�特征选择保留的属性子集是�W?�Q?�Q?个属性�?br />

changedi 2010-11-23 10:06 发表评论

changedi — Thu, 04 Nov 2010 01:51:00 GMT
     摘要: 书接上文�Q�既然写�?ji��n)聚�c�，再把我用到的分类的相关代码奉上�?  1/** *//**   2 *   3 */   4package edu.tju.ikse.mi.util;   5   6import j...  阅读全文

changedi 2010-11-04 09:51 发表评论

changedi — Thu, 04 Nov 2010 01:24:00 GMT
     摘要: weka是著名的数据挖掘工具�Q�在�q�里有详�l�介�l�，IDMer老师的博客里也有比较详细的用法描�q�。当�?d��ng)��如果直接使用weka的工��P��自然没有问题�Q�但是如果想用weka的功能在自己的��^台框架中呢？我这里放��Z��个当初对weka的源码学�?f��n)过�E�，主要是如何调用weka的api。仅供参考，代码中有什么问题，�Ƣ迎邮�g联系�?�q�里��单讲解一下流�E�。构造方法首先蝲入一个arff文�g�Q�然后调用doCluster�Q�）(j��)�?..  阅读全文

changedi 2010-11-04 09:24 发表评论

贝叶斯决�{�——�ȝ��W�记

changedi — Wed, 15 Sep 2010 03:23:00 GMT
        贝叶斯决�{�论的基本思想非常��单。�ؓ(f��)最��化总风险，��L��选择那些能够最��化条�g风险R(a|x)的行为。尤其是�Q��ؓ(f��)�?ji��n)最��化分类问题?sh��)��的误差概率�Q��L��选择那些使后验概率P(wj|x)最大的�c�d��。贝叶斯公式允许我们通过先验概率P(wj)和条件密度p(x|wj)来计��后验概率。如果对在模�?span style="font-family: symbol">wj中所做的误分的惩�|�与模式wj的不同，那么在做出判册��Z��前，必须先根据该惩罚函数对后验概率加权�?br />         如果内在分布为多元的高斯分布�Q�判册��界将是超二次型，其�Ş状和位置取决于先验概率、该分布的均值和协方差。实际的期望误差率的上界可由Chernoff界和计算上较��单的Bhattacharyya界来��定。如果其输入��试模式��h��丢失或遭到破坏的特征量，必须通过在这些特征量上积分来形成边缘分布�Q�然后将贝叶斯决�{�过�E�用于其所得分布上�?br />         而实际操作中�Q�我们得到的多是包含各种属性的特征数据�Q�从中定义风险函数、先验概率和条�g概率往往是重要的前提操作。这样在�l�定�?ji��n)有限数据的情况下，�q�些概率的获取就是统计的事情�?ji��n)。下一步问题就是获取这些概率，那么常用的方法就是最大似然估计和贝叶斯参��C��计了(ji��n)�?

changedi 2010-09-15 11:23 发表评论

四虎亚洲精品,中文字幕在线影院,国产日韩欧美一区二区

weka定制计划 已添加到github

weka特征预处理的一些tip

weka的java使用(3)——特征选择

贝叶斯决�{�——�ȝ���W�记

weka定制计划已添加到github

贝叶斯决�{�——�ȝ��W�记