系统仿真学报 ›› 2018, Vol. 30 ›› Issue (10): 3835-3842.doi: 10.16182/j.issn1004731x.joss.201810029

• 仿真应用工程 • 上一篇    下一篇

SVD优化初始簇中心的K-means中文文本聚类算法

戴月明*, 王明慧, 张明, 王艳   

  1. 江南大学 教育部物联网技术应用工程研究中心,江苏 无锡214122
  • 收稿日期:2016-09-22 修回日期:2017-01-11 出版日期:2018-10-10 发布日期:2019-01-04
  • 作者简介:戴月明(1964-),男,江苏常熟,硕士,副教授,硕导,研究方向为人工智能和软件工程;王明慧(1992-),女,黑龙江哈尔滨,硕士,研究方向数据挖掘和人工智能。
  • 基金资助:
    国家自然科学基金(61572238),江苏省杰出青年基金(BK20160001)

Optimizing Initial Cluster Centroids by SVD in K-means Algorithm for Chinese Text Clustering

Dai Yueming*, Wang Minghui, Zhang Ming, Wang Yan   

  1. Engineering Research Center of Internet of Things Technology Applications Ministry of Education, Jiangnan University, Wuxi 214122, China
  • Received:2016-09-22 Revised:2017-01-11 Online:2018-10-10 Published:2019-01-04

摘要: 为了改善传统K-means算法在聚类过程中,聚类数目K难以准确预设,聚类结果受初始中心影响,对噪声点敏感,不稳定等缺点,同时针对文本聚类中文本向量化后数据维数较高,空间分布稀疏,存在潜在语义结构等问题,提出了一种利用奇异值分解(Singular Value Decomposition, SVD)的物理意义进行粗糙分类,再结合K-means算法的中文文本聚类优化算法(SVD-Kmeans)。新算法利用SVD分解的数学意义对文本数据进行了平滑处理,同时利用SVD分解的物理意义对文本数据进行粗糙分类,将分类的结果作为K-means算法的初始聚类中心点。实验结果表明,相比其他K-means及其改进算法,SVD-Kmeans算法的聚类质量F-Measure值有明显提升。

关键词: SVD, 文本聚类, K-means, 初始中心点

Abstract: In process of clustering with traditional K-means algorithm, it is difficult to identify the value of the number of clusters K and its clustering results are influenced by initial centers. It has the weakness of sensitivity to noise and instability. Meanwhile, to solve the problems for the high dimensions, sparse spatial and latent semantic structure of the text data, an algorithm for Chinese text clustering was proposed. This new algorithm uses the physical significance of Singular Value Decomposition (SVD) to firstly classify the data rough, and then uses K-means for text clustering. It applies SVD to decompose and keep semantic features, remove noise, make smoothing process of text data, meanwhile, it takes the advantage of physical significance of SVD to have rough set classification, and then regard classification results as initial centers of K-means. Experiment results demonstrate that the F-Measure of cluster quality has been improved compared with other K-means algorithms.

Key words: SVD, text clustering, K-means, initial center point

中图分类号: