無為

          無為則可為,無為則至深!

            BlogJava :: 首頁 :: 聯系 :: 聚合  :: 管理
            190 Posts :: 291 Stories :: 258 Comments :: 0 Trackbacks

          Clustering is an important application area for many fields including data mining [FPSU96], statistical data analysis [KR89,BR93,FHS96], compression [ZRL97], vector quantization, and other business applications [B*96]. Clustering has been formulated in various ways in the machine learning [F87], pattern recognition [DH73,F90], optimization [BMS97,SI84], and statistics literature [KR89,BR93,B95,S92,S86]. The fundamental clustering problem is that of grouping together (clustering) similar data items.

          The most general approach is to view clustering as a density estimation problem [S86, S92, BR93]. We assume that in addition to the observed variables for each data item, there is a hidden, unobserved variable indicating the “cluster membership”. The data is assumed to arrive from a mixture model with hidden cluster identifiers. In general, a mixture model M having K clusters Ci, i=1,…,K, assigns a probability to a data point x: ?where Wi are the mixture weights. The problem is estimating the parameters of the individual Ci, assuming that the number of clusters K is known. The clustering optimization problem is that of finding parameters of the individual Ci which maximize the likelihood of the database given the mixture model. For general assumptions on the distributions for each of the K clusters, the EM algorithm [DLR77, CS96] is a popular technique for estimating the parameters. The assumptions addressed by the classic K-Means algorithm are: 1) each cluster can be effectively modeled by a spherical Gaussian distribution, 2) each data item is assigned to 1 cluster, 3) the mixture weights (Wi) are assumed equal. Note that KMeans [DH73, F90] is only defined over numeric (continuous-valued) data since the ability to compute the mean is required. A discrete version of K-Means exists and is sometimes referred to as harsh EM. The K-Mean algorithm finds a locally optimal solution to the problem of minimizing the sum of the L2 distance between each data point and its nearest cluster center (“distortion”) [SI84], which is equivalent to a maximizing the likelihood given the assumptions listed.

          There are various approaches to solving the optimization problem. The iterative refinement approaches, which include EM and K-Means, are the most effective. The basic algorithm is as follows: 1) Initialize the model parameters, producing a current model, 2) Decide memberships of the data items to clusters, assuming that the current model is correct, 3) Re-estimate the parameters of the current model assuming that the data memberships obtained in 2 are correct, producing new model, 4) If current model and new model are sufficiently close to each other, terminate, else go to 2).

          K-Means parameterizes cluster Ci by the mean of all points in that cluster, hence the model update step 3) consists of computing the mean of the points assigned to a given cluster. The membership step 2) consists of assigning data points to the cluster with the nearest mean measured in the L2 metric.

          We focus on the problem of clustering very large databases, those too large to be “loaded” in RAM. Hence the data scan at each iteration is extremely costly. We focus on the KMeans algorithm although the method can be extended to accommodate other algorithms [BFR98]. K-Means is a well-known algorithm, originally known as Forgy’s method [F65,M67] and has been used extensively in pattern recognition [DH73, F90]. It is a standard technique used in a wide array of applications, even as a way to initialize more expensive EM clustering [B95,CS96,MH98,FRB98, BF98].



          凡是有該標志的文章,都是該blog博主Caoer(草兒)原創,凡是索引、收藏
          、轉載請注明來處和原文作者。非常感謝。

          posted on 2006-06-24 13:54 草兒 閱讀(202) 評論(0)  編輯  收藏 所屬分類: BI and DM
          主站蜘蛛池模板: 峡江县| 周至县| 西宁市| 太谷县| 辉南县| 澄迈县| 阳泉市| 莆田市| 黄梅县| 永川市| 吴旗县| 旬邑县| 三穗县| 张家界市| 保德县| 四平市| 巴里| 湄潭县| 双峰县| 济南市| 宝清县| 乌苏市| 沅陵县| 娄烦县| 师宗县| 江山市| 灵武市| 河南省| 堆龙德庆县| 鹤壁市| 宜川县| 乌审旗| 永定县| 龙门县| 黄大仙区| 青岛市| 西峡县| 许昌市| 昌邑市| 灵璧县| 诸城市|