SVD优化初始簇中心的K-means中文文本聚类算法

doi:10.16182/j.issn1004731x.joss.201810029

系统仿真学报 ›› 2018, Vol. 30 ›› Issue (10): 3835-3842.doi: 10.16182/j.issn1004731x.joss.201810029

SVD优化初始簇中心的K-means中文文本聚类算法

戴月明^*, 王明慧, 张明, 王艳

江南大学教育部物联网技术应用工程研究中心,江苏无锡214122

收稿日期:2016-09-22 修回日期:2017-01-11 出版日期:2018-10-10 发布日期:2019-01-04
作者简介:戴月明(1964-),男,江苏常熟,硕士,副教授,硕导,研究方向为人工智能和软件工程;王明慧(1992-),女,黑龙江哈尔滨,硕士,研究方向数据挖掘和人工智能。
基金资助:
国家自然科学基金(61572238),江苏省杰出青年基金(BK20160001)

Optimizing Initial Cluster Centroids by SVD in K-means Algorithm for Chinese Text Clustering

Dai Yueming^*, Wang Minghui, Zhang Ming, Wang Yan

Engineering Research Center of Internet of Things Technology Applications Ministry of Education, Jiangnan University, Wuxi 214122, China

Received:2016-09-22 Revised:2017-01-11 Online:2018-10-10 Published:2019-01-04

摘要/Abstract

摘要： 为了改善传统K-means算法在聚类过程中,聚类数目K难以准确预设,聚类结果受初始中心影响,对噪声点敏感,不稳定等缺点,同时针对文本聚类中文本向量化后数据维数较高,空间分布稀疏,存在潜在语义结构等问题,提出了一种利用奇异值分解(Singular Value Decomposition, SVD)的物理意义进行粗糙分类,再结合K-means算法的中文文本聚类优化算法(SVD-Kmeans)。新算法利用SVD分解的数学意义对文本数据进行了平滑处理,同时利用SVD分解的物理意义对文本数据进行粗糙分类,将分类的结果作为K-means算法的初始聚类中心点。实验结果表明,相比其他K-means及其改进算法,SVD-Kmeans算法的聚类质量F-Measure值有明显提升。

关键词: SVD, 文本聚类, K-means, 初始中心点

Abstract: In process of clustering with traditional K-means algorithm, it is difficult to identify the value of the number of clusters K and its clustering results are influenced by initial centers. It has the weakness of sensitivity to noise and instability. Meanwhile, to solve the problems for the high dimensions, sparse spatial and latent semantic structure of the text data, an algorithm for Chinese text clustering was proposed. This new algorithm uses the physical significance of Singular Value Decomposition (SVD) to firstly classify the data rough, and then uses K-means for text clustering. It applies SVD to decompose and keep semantic features, remove noise, make smoothing process of text data, meanwhile, it takes the advantage of physical significance of SVD to have rough set classification, and then regard classification results as initial centers of K-means. Experiment results demonstrate that the F-Measure of cluster quality has been improved compared with other K-means algorithms.

Key words: SVD, text clustering, K-means, initial center point

中图分类号:

TP317

戴月明, 王明慧, 张明, 王艳. SVD优化初始簇中心的K-means中文文本聚类算法[J]. 系统仿真学报, 2018, 30(10): 3835-3842.

Dai Yueming, Wang Minghui, Zhang Ming, Wang Yan. Optimizing Initial Cluster Centroids by SVD in K-means Algorithm for Chinese Text Clustering[J]. Journal of System Simulation, 2018, 30(10): 3835-3842.

参考文献

[1] 翟东海, 余江, 高飞, 等. 最大距离法选取初始簇中心的K-means文本聚类算法的研究[J]. 计算机应用研究, 2014, 31(3): 713-715.
Zhai Donghai, Yu Jiang, Gao Fei.K-means text clustering algorithm based on initial cluster centers selection according to maximum distance[J]. Application Research of Computers, 2014, 31(3): 713-715.
[2] Sholom M Weiss, Nitin Indurkhya, Tong Zhang.Fundamentals of predictive text mining [M]. Xi’an, China: Xi’an Jiaotong University Press, 2012: 97-103.
[3] 彭京, 杨冬青, 唐世渭, 等. 一种基于语义内积空间模型的文本聚类算法[J]. 计算机学报, 2007, 30(8): 1354-1363.
Peng Jing, Yang Dongqing, Tang Shiwei.A Novel Text Clustering Algorithm Based on Inner Product Space Model of Semantic[J]. Chnese Journal of Computers, 2007, 30(8): 1354-1363.
[4] 邓海, 覃华, 孙欣. 一种优化初始中心的K-means聚类算法[J]. 计算机技术与发展, 2013, 23(11): 42-45.
Deng Hai, Tan Hua, Sun Xin.A K-means Clustering Algorithm of Meliorated Initial Center[J]. Computer Technology and Development, 2013, 23(11): 42-45.
[5] Wong K C.A short survey on data clustering algorithms[C]// International Conference on Soft Computing and Machine Intelligence. USA: IEEE, 2015: 64-68.
[6] 熊忠阳, 陈若田, 张玉芳. 一种有效的K-means聚类中心初始化方法[J]. 计算机应用研究, 2011, 28(11): 4188-4190.
Xiong Zhongyang, Chen Ruotian, Zhang Yufang.Effective method for cluster centers’s initialization in K-means clustering[J]. Application Research of Computers, 2011, 28(11): 4188-4190.
[7] 龚静, 李安民. 一种改进的k-means中文文本聚类算法[J]. 湖南工业大学学报, 2008, 22(2): 52-54.
Gong Jing, Li Anmin.Clustering Algorithm of One Improved K-means Chinese Text[J]. Journal of Hunan University of Technology, 2008, 22(2): 52-54.
[8] Shehroz S Khan, Amir Ahmad.Cluster center initialization algorithm for K-Means clustering[J]. Pattern Recognition Letters (S0167-8655), 2004, 25(11): 1293-1302.
[9] 牛棍, 张舒博, 陈俊亮. 融合网格密度的聚类中心初始化方案[J]. 北京邮电大学学报, 2005, 30(2): 7-10.
Niu Kun, Zhang Shubo, Chen Junliang.A Cell Density Enabled Schema for Initializing Cluster Centers[J]. Journal of Beijing University of Posts and Telecommunications, 2005, 30(2): 7-10.
[10] 张健沛, 杨悦, 杨静, 等. 基于最优划分的K-Means初始聚类中心选取算法[J]. 系统仿真学报, 2009, 21(9): 2586-2590.
Zhang Jianpei, Yang Yue, Yang Jing.Algorithm for Initialization of K-means Clustering Center Based on Optimized-Division[J]. Journal of System Simulation (S1004-731X), 2009, 21(9): 2586-2590.
[11] 何亮亮. SVD在文本分类中的应用 [D]. 广州: 华南理工大学, 2012.
He Liangliang.Application of the SVD in text classification [D]. Guangzhou, China: South China University of Technology, 2012.
[12] 吴夙慧, 成颖, 郑彦宁, 等. 文本聚类中文本表示和相似度计算研究综述[J]. 情报科学, 2012, 22(4): 22-25.
Wu Suhui, Cheng Ying, Zhen Yanyu.A Review of Text Representation and Similarity Calculation in Text Clustering[J]. Information Science, 2012, 22(4): 22-25.
[13] 黄承慧, 印鉴, 侯昉. 一种结合词项语义信息和TF-IDF方法的文本相似度量方法[J]. 计算机学报, 2011, 34(5): 856-864.
Huang Chenghui, Yin Jian, Hou Fang.A Text Similarity Measurement Combining Word Semantic Information with TF-IDF Method[J]. Chinese Journal of Computers, 2011, 34(5): 856-864.
[14] 林少波, 杨丹, 徐玲. 基于类别相关的新文本特征提取方法[J]. 计算机应用研究, 2012, 29(5): 1680-1683.
Lin Shaobo, Yang Dan, Xu Ling.New Approach to Feature Selection for Text Categorization Using Class Correlation[J]. Application Research of Computers, 2012, 29(5): 1680-1683.
[15] 周昭涛. 文本聚类分析效果评价及文本表示研究 [D]. 北京: 中国科学院研究生院(计算技术研究所), 2005.
Zhou Zhaotao.Quality Evaluation of Text Clustering Results and Investigation on Text Representation [D]. Beijing, China: Graduate University of Chinese Academy of Science (Computer Software and Theory), 2005.
[16] K Van Rijsbergen.Information retrieval [M]. London, UK: Butterworths Press, 1979: 267-301.
[17] Gu M, Demmel J, Dhillon I.LAPACK Working Note 88: Efficient Computation of the Singular Value Decomposition with Applications to Least Squares Problems[M]. USA: University of Tennessee, 1994: 68-70.
[18] 廖安平, 刘建州. 矩阵论[M]. 湖南: 湖南大学出版社, 2005: 57-58.
Liao Anping,Liu Jianzhou.Matrix Theory [M]. Hunan, China: Hunan University Press, 2005: 57-58.
[19] 吴军. 数学之美[M]. 北京: 人民邮电出版社, 2014: 136-141.
Wu Jun.The beauty of Mathematics [M]. Beijing, China: People Post Press, 2014: 136-141.
[20] 王怡, 盖杰, 武港山, 等. 基于潜在语义分析的中文文本层次分类技术[J]. 计算机应用研究, 2004, 21(8): 151-154.
Wang Yi, Gai Jie, Wu Gangshan.Technology of Chinese Documents Multi-hierarchy Categorization Based on Latent Semantic Analysis[J]. Application Research of Computers, 2004, 21(8): 151-154.
[21] Golub G, Kahan W.Calculating the singular values and pseudo-inverse of matrix[J]. Siam Journal on Numerical Analysis (S1095-7170), 1965, 2(2): 205-224.
[22] 蔡宇浩, 梁永全, 樊建聪, 等. 加权局部方差优化初始簇中心的K-means算法[J]. 计算机科学与探索, 2016, 10(5): 732-741.
Cai Yuhao, Liang Yongquan, Fan Jiancong.Optimizing Initial Cluster Centroids by Weighted Local Variance in K-means Algorithm[J]. Journal of Frontiers of Computer Science and Technology, 2016, 10(5): 732-741.

SVD优化初始簇中心的K-means中文文本聚类算法

Optimizing Initial Cluster Centroids by SVD in K-means Algorithm for Chinese Text Clustering

PDF (PC)

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 6

编辑推荐

Metrics

本文评价

[1]	王闯, 张勇, 李学贵, 董宏丽. 改进粒子群优化算法及其在聚类分析中应用[J]. 系统仿真学报, 2020, 32(8): 1577-1587.
[2]	吴敬兵, 唐汉卿, 胥军. 水泥窑协同处置生活垃圾的燃烧特性分析优化[J]. 系统仿真学报, 2020, 32(1): 35-43.
[3]	王杰, 王艳. 基于量子遗传聚类算法的质量控制方法[J]. 系统仿真学报, 2019, 31(12): 2591-2599.
[4]	李科, 游雄, 杜琳. 基于多特征组合与优化BoW模型的影像分类技术研究[J]. 系统仿真学报, 2016, 28(6): 1386-1393.
[5]	苏本跃, 马金宇, 彭玉升, 盛敏. 基于K-means聚类的RGBD点云去噪和精简算法[J]. 系统仿真学报, 2016, 28(10): 2329-2335.
[6]	石祥滨, 刘拴朋, 张德园. 基于关键帧的人体动作识别方法[J]. 系统仿真学报, 2015, 27(10): 2401-2408.