基于密度峰值聚类的Tri-training算法

doi:10.16182/j.issn1004731x.joss.22-1550

系统仿真学报 ›› 2024, Vol. 36 ›› Issue (5): 1189-1198.doi: 10.16182/j.issn1004731x.joss.22-1550

基于密度峰值聚类的Tri-training算法

罗宇航¹(), 吴润秀¹, 崔志华², 张翼英³, 何业慎⁴, 赵嘉¹()

^1.南昌工程学院信息工程学院，江西南昌 330099
^2.太原科技大学计算机科学与技术学院，山西太原 030024
^3.天津科技大学人工智能学院，天津 300457
^4.深圳市国电科技通信有限公司，广东深圳 518000

收稿日期:2022-12-29 修回日期:2023-03-11 出版日期:2024-05-15 发布日期:2024-05-21
通讯作者: 赵嘉 E-mail:1658051291@qq.com;zhaojia925@163.com
第一作者简介:罗宇航(1997-)，男，硕士生，研究方向为数据挖掘。E-mail：1658051291@qq.com
基金资助:
国家自然科学基金(52069014)

Tri-training Algorithm Based on Density Peaks Clustering

Luo Yuhang¹(), Wu Runxiu¹, Cui Zhihua², Zhang Yiying³, He Yeshen⁴, Zhao Jia¹()

^1.School of Information Engineering, Nanchang Institute of Technology, Nanchang 330099, China
^2.College of Computer Science and Technology, Taiyuan University of Science and Technology, Taiyuan 030024, China
^3.College of Artificial Intelligence, Tianjin University of Science & Technology, Tianjin 300457, China
^4.China Gridcom Co. , Ltd. , Shenzhen 518000, China

Received:2022-12-29 Revised:2023-03-11 Online:2024-05-15 Published:2024-05-21
Contact: Zhao Jia E-mail:1658051291@qq.com;zhaojia925@163.com

摘要/Abstract

摘要：

Tri-training利用无标签数据进行分类可有效提高分类器的泛化能力，但其易将无标签数据误标，从而形成训练噪声。提出一种基于密度峰值聚类的Tri-training(Tri-training with density peaks clustering，DPC-TT)算法。密度峰值聚类通过类簇中心和局部密度可选出数据空间结构表现较好的样本。DPC-TT算法采用密度峰值聚类算法获取训练数据的类簇中心和样本的局部密度，对类簇中心的截断距离范围内的样本认定为空间结构表现较好，标记为核心数据，使用核心数据更新分类器，可降低迭代过程中的训练噪声，进而提高分类器的性能。实验结果表明：相比于标准Tri-training算法及其改进算法，DPC-TT算法具有更好的分类性能。

关键词: Tri-training, 半监督学习, 密度峰值聚类, 空间结构, 分类器

Abstract:

Tri-training can effectively improve the generalization ability of classifiers by using unlabeled data for classification, but it is prone to mislabeling unlabeled data, thus forming training noise. Tri-training (Tri-training with density peaks clustering, DPC-TT) algorithm based on density peaks clustering is proposed. The DPC-TT algorithm uses the density peaks clustering algorithm to obtain the class cluster centers and local densities of the training data, and the samples within the truncation distance of the class cluster centers are identified as the samples with better spatial structure, and these samples are labeled as the core data, and the classifier is updated with the core data, which can reduce the training noise during the iteration to improve the performance of the classifier. The experimental results show that the DPC-TT algorithm has better classification performance compared with the standard Tri-training algorithm and its improvement algorithm.

Key words: Tri-training, semi-supervised learning, density peaks clustering, spatial structure, classifier

中图分类号:

TP391.9

罗宇航,吴润秀,崔志华等 . 基于密度峰值聚类的Tri-training算法[J]. 系统仿真学报, 2024, 36(5): 1189-1198.

Luo Yuhang,Wu Runxiu,Cui Zhihua,et al . Tri-training Algorithm Based on Density Peaks Clustering[J]. Journal of System Simulation, 2024, 36(5): 1189-1198.

图/表 11

图1

表1

表2

混淆矩阵

真实类别	预测类别
真实类别	正类	负类
正类	$N T P$	$N F N$
负类	$N F P$	$N T N$

表2

表3

表4

表5

表6

图2

表7

图3

表8

参考文献 30

1	Wang Shuang, Guo Yanhe, Hua Wenqiang, et al. Semi-supervised PolSAR Image Classification Based on Improved Tri-training with a Minimum Spanning Tree[J]. IEEE Transactions on Geoscience and Remote Sensing, 2020, 58(12): 8583-8597.
2	Li Zongyao, Togo R, Ogawa Takahiro, et al. Chronic Gastritis Classification Using Gastric X-ray Images with a Semi-supervised Learning Method Based on Tri-training[J]. Medical & Biological Engineering & Computing, 2020, 58(6): 1239-1250.
3	Yin Chunyong, Haoqi Cuan, Zhu Yuhang, et al. Improved Fake Reviews Detection Model Based on Vertical Ensemble Tri-training and Active Learning[J]. ACM Transactions on Intelligent Systems and Technology, 2021, 12(3): 33.
4	Khonde S R, Ulagamuthalvi V. Ensemble-based Semi-supervised Learning Approach for a Distributed Intrusion Detection System[J]. Journal of Cyber Security Technology, 2019, 3(3): 163-188.
5	Zhao Jia, Li Song, Wu Runxiu, et al. Tri-training Algorithm Based on Cross Entropy and K-nearest Neighbors for Network Intrusion Detection[J]. KSII Transactions on Internet and Information Systems, 2022, 16(12): 3889-3903.
6	韩嵩, 韩秋弘. 半监督学习研究的述评[J]. 计算机工程与应用, 2020, 56(6): 19-27.
	Han Song, Han Qiuhong. Review of Semi-supervised Learning Research[J]. Computer Engineering and Applications, 2020, 56(6): 19-27.
7	Zhou Zhihua, Li Ming. Semi-supervised Learning by Disagreement[J]. Knowledge and Information Systems, 2010, 24(3): 415-439.
8	屠恩美, 杨杰. 半监督学习理论及其研究进展概述[J]. 上海交通大学学报, 2018, 52(10): 1280-1291.
	Tu Enmei, Yang Jie. A Review of Semi-supervised Learning Theories and Recent Advances[J]. Journal of Shanghai Jiaotong University, 2018, 52(10): 1280-1291.
9	刘建伟, 刘媛, 罗雄麟. 半监督学习方法[J]. 计算机学报, 2015, 38(8): 1592-1617.
	Liu Jianwei, Liu Yuan, Luo Xionglin. Semi-supervised Learning Methods[J]. Chinese Journal of Computers, 2015, 38(8): 1592-1617.
10	周志华. 基于分歧的半监督学习[J]. 自动化学报, 2013, 39(11): 1871-1878.
	Zhou Zhihua. Disagreement-based Semi-supervised Learning[J]. Acta Automatica Sinica, 2013, 39(11): 1871-1878.
11	Miller D J, Uyar H S. A Mixture of Experts Classifier with Learning Based on Both Labelled and Unlabelled Data[C]//Proceedings of the 9th International Conference on Neural Information Processing Systems. Cambridge, MA, USA: MIT Press, 1996: 571-577.
12	Blum A, Chawla S. Learning from Labeled and Unlabeled Data Using Graph Mincuts[C]//Proceedings of the Eighteenth International Conference on Machine Learning. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 2001: 19-26.
13	Chapelle O, Sindhwani V, Keerthi S S. Optimization Techniques for Semi-supervised Support Vector Machines[J]. The Journal of Machine Learning Research, 2008, 9: 203-233.
14	Blum A, Mitchell T. Combining Labeled and Unlabeled Data with Co-training[C]//Proceedings of the Eleventh Annual Conference on Computational Learning Theory. New York, NY, USA: Association for Computing Machinery, 1998: 92-100.
15	Zhou Zhihua, Li Ming. Tri-training: Exploiting Unlabeled Data Using Three Classifiers[J]. IEEE Transactions on Knowledge and Data Engineering, 2005, 17(11): 1529-1541.
16	邓超, 郭茂祖. 基于自适应数据剪辑策略的Tri-training算法[J]. 计算机学报, 2007, 30(8): 1213-1226.
	Deng Chao, Guo Maozu. ADE-tri-training: Tri-training with Adaptive Data Editing[J]. Chinese Journal of Computers, 2007, 30(8): 1213-1226.
17	Li Dunming, Mao J, Shen Fuke. A Novel Semi-supervised Adaboost Technique Based on Improved Tri-training[C]//Proceedings of the 24th Australasian Conference on Information Security and Privacy. Cham: Springer International Publishing, 2019: 669-678.
18	张永, 陈蓉蓉, 张晶. 基于交叉摘的安全Tri-training算法[J]. 计算机研究与发展, 2021, 58(1): 60-69.
	Zhang Yong, Chen Rongrong, Zhang Jing. Safe Tri-training Algorithm Based on Cross Entropy[J]. Journal of Computer Research and Development, 2021, 58(1): 60-69.
19	Zhao Jia, Luo Yuhang, Xiao Renbin, et al. Tri-training Algorithm for Adaptive Nearest Neighbor Density Editing and Cross Entropy Evaluation[J]. Entropy, 2023, 25(3): 480.
20	王宇飞, 陈文. 基于DECORATE集成学习与置信度评估的Tri-training算法[J]. 计算机科学, 2022, 49(6): 127-133.
	Wang Yufei, Chen Wen. Tri-training Algorithm Based on DECORATE Ensemble Learning and Credibility Assessment[J]. Computer Science, 2022, 49(6): 127-133.
21	Gan Haitao, Sang Nong, Huang Rui, et al. Using Clustering Analysis to Improve Semi-supervised Classification[J]. Neurocomputing, 2013, 101: 290-298.
22	Wu Di, Shang Mingsheng, Luo Xin, et al. Self-training Semi-supervised Classification Based on Density Peaks of Data[J]. Neurocomputing, 2018, 275: 180-191.
23	Rodriguez Alex, Laio Alessandro. Clustering by Fast Search and Find of Density Peaks[J]. Science, 2014, 344(6191): 1492-1496.
24	Zhao Jia, Wang Gang, Pan J S, et al. Density Peaks Clustering Algorithm Based on Fuzzy and Weighted Shared Neighbor for Uneven Density Datasets[J]. Pattern Recognition, 2023, 139: 109406.
25	Angluin D, Laird P. Learning from Noisy Examples[J]. Machine Learning, 1988, 2(4): 343-370.
26	Dua D, G`raff C. UCI Machine Learning Repository[EB/OL]. [2023-02-20]. .
27	Fowlkes E B, Mallows C L. A Method for Comparing Two Hierarchical Clusterings[J]. Journal of the American Statistical Association, 1983, 78(383): 553-569.
28	Demšar Janez. Statistical Comparisons of Classifiers Over Multiple Data Sets[J]. The Journal of Machine Learning Research, 2006, 7: 1-30.
29	贺朝, 康平, 李卿鹏, 等. 多策略集成萤火虫算法[J]. 南昌工程学院学报, 2023, 42(1): 80-87.
	He Chao, Kang Ping, Li Qingpeng, et al. Firefly Algorithm with Combination of Multi-strategies[J]. Journal of Nanchang Institute of Technology, 2023, 42(1): 80-87.
30	Zhao Jia, Chen Dandan, Xiao Renbin, et al. Multi-strategy Ensemble Firefly Algorithm with Equilibrium of Convergence and Diversity[J]. Applied Soft Computing, 2022, 123: 108938.

数据集	样本数	属性数	正类/%	负类/%
australian	690	14	44.5	55.5
wdbc	569	30	37.3	62.7
abalone	4 177	8	32.1	67.9
bupa	345	6	42.0	58.0
electrical	10 000	13	36.2	63.8
german	1 000	24	30.0	70.0
haberman	306	3	26.5	73.5
heart	270	13	44.4	55.6
spectf	267	44	20.6	79.4

数据集	Tri-training	TCE	ST	STCE	DPC-TT
australian	0.802 2	0.820 6	0.826 6	0.849 7	0.865 7
wdbc	0.937 9	0.951 0	0.944 1	0.958 0	0.972 1
abalone	0.808 3	0.779 9	0.783 7	0.803 8	0.823 0
bupa	0.597 1	0.563 2	0.551 7	0.597 7	0.645 6
electrical	0.975 3	0.994 4	0.994 0	0.996 0	0.999 5
german	0.711 6	0.732 0	0.740 0	0.746 0	0.730 7
haberman	0.704 9	0.551 3	0.692 3	0.538 5	0.754 1
heart	0.763 2	0.720 6	0.764 7	0.779 4	0.803 8
spectf	0.641 5	0.641 8	0.567 2	0.626 9	0.679 2

数据集	Tri-training	TCE	ST	STCE	DPC-TT
australian	0.742 2	0.776 5	0.807 7	0.782 6	0.852 6
wdbc	0.935 9	0.985 7	0.971 8	1.000 0	0.961 4
abalone	0.612 0	0.545 5	0.555 1	0.614 3	0.752 2
bupa	0.612 6	0.800 0	0.681 8	0.687 5	0.620 9
electrical	0.974 6	0.994 5	0.992 3	0.993 5	0.999 0
german	0.675 0	0.506 7	0.518 5	0.565 2	0.747 1
haberman	0.369 3	0.172 4	0.277 8	0.187 5	0.454 2
heart	0.716 5	0.647 1	0.709 7	0.705 9	0.782 8
spectf	0.614 7	0.342 9	0.277 8	0.342 1	0.586 2

数据集	Tri-training	TCE	ST	STCE	DPC-TT
australian	0.749 0	0.809 8	0.807 7	0.847 1	0.777 5
wdbc	0.935 9	0.951 7	0.945 2	0.958 3	0.961 4
abalone	0.564 4	0.575 6	0.572 0	0.572 0	0.768 3
bupa	0.512 6	0.387 1	0.434 8	0.557 0	0.620 8
electrical	0.974 6	0.992 3	0.991 8	0.994 5	0.999 1
german	0.545 0	0.531 5	0.563 8	0.569 3	0.747 0
haberman	0.284 3	0.222 2	0.294 1	0.250 0	0.378 8
heart	0.741 0	0.698 4	0.733 3	0.761 9	0.842 1
spectf	0.545 8	0.500 0	0.408 2	0.509 8	0.693 8

数据集	Tri-training	TCE	ST	STCE	DPC-TT
australian	0.668 5	0.846 2	0.807 7	0.923 1	0.695 0
wdbc	0.886 6	0.920 0	0.920 0	0.920 0	0.961 2
abalone	0.640 8	0.609 4	0.589 8	0.535 2	0.636 8
bupa	0.327 6	0.255 3	0.319 1	0.468 1	0.316 5
electrical	0.957 7	0.990 2	0.991 3	0.995 6	0.999 5
german	0.883 3	0.558 8	0.617 6	0.573 5	0.936 3
haberman	0.233 3	0.312 5	0.312 5	0.375 0	0.329 4
heart	0.911 2	0.758 6	0.758 6	0.827 6	0.911 4
spectf	0.692 8	0.923 1	0.769 2	1.000 0	0.850 0

基于密度峰值聚类的Tri-training算法

Tri-training Algorithm Based on Density Peaks Clustering

RichHTML

PDF (PC)

可视化

摘要/Abstract

引用本文

使用本文

图/表 11

参考文献 30

相关文章 2

编辑推荐

Metrics

本文评价

数据集	FMI
australian	0.544 5
wdbc	0.745 5
abalone	0.560 6
bupa	0.511 2
electrical	0.533 0
german	0.761 3
haberman	0.567 7
heart	0.710 1
spectf	0.454 2

评价指标	Tri-training	TCE	ST	STCE	DPC-TT
均值	2.36	2.36	2.46	3.39	4.56
准确率	2.22	2.33	2.33	3.44	4.67
召回率	2.67	2.44	2.56	3.11	4.22
F1值	2.11	2.22	2.39	3.61	4.67
精度	2.44	2.44	2.56	3.39	4.67

[1]	张虎成, 杨镜宇. 基于GABC算法的作战体系智能优化方法研究[J]. 系统仿真学报, 2023, 35(1): 221-227.
[2]	卿东升, 张晓芳, 李建军, 郭瑞, 邓巧玲. 基于蜂群-粒子群算法的天然林空间结构优化[J]. 系统仿真学报, 2020, 32(3): 371-381.