基于数据生成模型的仿真样本点插补方法

doi:10.16182/j.issn1004731x.joss.22-0554

摘要/Abstract

摘要：

为解决插补的仿真样本点与真实样本点概率分布不一致的问题，提出了基于数据生成模型的仿真样本点插补方法。基于高斯混合模型构建真实样本点的数据生成模型，其对应的高斯混合模型构件数通过多模型融合的策略确定；利用在真实样本点上获得的数据生成模型插补所需的仿真样本点，其中数据生成模型的构件以及构件权重用于控制仿真样本点的生成方式。在20个多模多维混合分布上对新方法的可行性和有效性进行了验证，实验结果表明，与随机样本点插补、合成少类过采样技术及其两种最新的变体等4种方法相比，本文方法能够获得更具概率分布一致性的仿真样本点，证实该方法是一种合理的仿真样本点插补方法。

关键词: 仿真样本点插补, 数据生成模型, 高斯混合模型, 合成少类过采样技术, 概率分布一致

Abstract:

In order to solve the problem of inconsistent probability distribution between synthetic samples by imputation and real samples, a data generation model-based synthetic sample imputation (DGM-SSI) method is proposed. The data generation model of real samples is constructed based on the Gaussian mixture model, and the number of corresponding components of the Gaussian mixture model is determined by the multi-model fusion strategy. The synthetic samples required for model imputation are generated by using the data obtained from the real samples. Specifically, the components of the data generation model and their weights are used to control the generation of synthetic samples. The feasibility and effectiveness of the DGM-SSI method are verified on 20 multi-model and multi-dimensional mixed distributions. The experiment result shows that compared with random sample imputation, synthetic minority over-sampling technique (SMOTE), and its two latest variants, the proposed method can obtain synthetic samples with a more consistent probability distribution, which proves that this method is a reasonable synthetic sample imputation method.

Key words: synthetic sample imputation, data generation model, Gaussian mixture model, synthetic minority over-sampling technique, probability distribution consistency

中图分类号:

TP391.9

何玉林,陈佳琪,徐贺鹏等 . 基于数据生成模型的仿真样本点插补方法[J]. 系统仿真学报, 2023, 35(9): 1948-1964.

He Yulin,Chen Jiaqi,Xu Hepeng,et al . Data Generation Model-based Synthetic Sample Imputation Method[J]. Journal of System Simulation, 2023, 35(9): 1948-1964.

图/表 14

图1

图2

图3

图4

图5

图6

图7

图8

表1

基于高斯分布的10种分布上5种仿真样本插补方法的MMD对比

数据集	真实样本点个数	仿真样本点个数	DGM-SSI	RSI	SMOTE	SMOTE-D	k-means SMOTE
$f 1 x$	2 000	2 000	0.015 8±0.003 0	0.225 6±0.014 7	0.830 2±0.014 3	0.704 6±0.047 0	0.832 8±0.014 3
$f 2 x$	4 000	4 000	0.012 4±0.001 0	0.354 1±0.001 5	0.892 6±0.002 7	0.823 6±0.014 4	0.892 3±0.002 2
$f 3 x$	4 000	4 000	0.039 7±0.037 5	0.267 7±0.002 4	0.056 0±0.010 1	0.087 6±0.007 2	0.076 1±0.007 7
$f 4 x$	8 000	8 000	0.017 7±0.004 9	0.136 2±0.004 6	0.486 0±0.003 3	0.394 6±0.009 1	0.642 8±0.012 4
$f 5 x$	6 000	6 000	0.018 8±0.001 2	0.193 0±0.002 8	0.310 9±0.004 3	0.257 3±0.006 8	0.353 6±0.024 8
$f 6 x$	12 000	12 000	0.012 6±0.001 6	0.149 6±0.000 1	0.410 4±0.004 9	0.341 7±0.006 3	0.412 2±0.002 5
$f 7 x$	8 000	8 000	0.014 9±0.000 6	0.132 6±0.001 4	0.267 0±0.007 1	0.224 4±0.004 6	0.267 2±0.005 6
$f 8 x$	16 000	16 000	0.015 4±0.003 6	0.081 8±0.000 1	0.246 8±0.002 0	0.208 9±0.005 9	0.246 7±0.001 8
$f 9 x$	10 000	10 000	0.012 8±0.000 1	0.120 2±0.001 0	0.267 9±0.004 9	0.225 0±0.001 2	0.419 9±0.255 2
$f 10 x$	20 000	20 000	0.009 7±0.000 1	0.060 6±0.000 2	0.180 8±0.000 8	0.148 7±0.002 6	0.180 9±0.000 9

表1

表2

基于高斯分布的10种分布上5种仿真样本插补方法的KL散度对比

数据集	真实样本点个数	仿真样本点个数	DGM-SSI	RSI	SMOTE	SMOTE-D	k-means SMOTE
$f 1 x$	2 000	2 000	0.001 2±0.000 8	0.011 9±0.006 9	1.301 3±0.283 3	0.224 5±0.040 6	0.011 9±0.006 9
$f 2 x$	4 000	4 000	0.000 9±0.001 3	0.011 3±0.013 1	4.791 1±2.109 6	9.903 4±3.213 9	0.011 3±0.013 1
$f 3 x$	4 000	4 000	0.042 8±0.071 7	0.392 1±0.046 2	0.130 4±0.055 4	0.376 4±0.076 1	0.392 1±0.046 2
$f 4 x$	8 000	8 000	0.009 3±0.008 1	0.087 6±0.037 0	0.571 7±0.018 4	0.199 6±0.064 8	0.087 6±0.037 0
$f 5 x$	6 000	6 000	0.005 3±0.001 1	0.051 1±0.005 4	0.112 2±0.020 6	0.087 4±0.014 9	0.051 1±0.005 4
$f 6 x$	12 000	12 000	0.003 0±0.002 6	0.015 2±0.004 4	0.120 2±0.005 0	0.968 4±0.007 0	0.015 2±0.004 4
$f 7 x$	8 000	8 000	0.001 8±0.000 4	0.026 8±0.010 3	0.154 7±0.012 0	0.236 2±0.006 8	0.026 8±0.010 3
$f 8 x$	16 000	16 000	0.035 5±0.030 5	0.023 2±0.000 8	0.150 4±0.019 2	0.302 1±0.011 9	0.023 2±0.000 8
$f 9 x$	10 000	10 000	0.001 7±0.001 1	0.175 9±0.005 5	0.127 2±0.023 6	0.189 2±0.024 3	0.175 9±0.005 5
$f 10 x$	20 000	20 000	0.001 3±0.000 9	0.033 4±0.005 9	0.143 2±0.037 4	2.536 1±0.087 2	0.033 4±0.005 9

表2

表3

基于均匀分布的10种分布上5种仿真样本插补方法的MMD对比

数据集	真实样本点个数	仿真样本点个数	DGM-SSI	RSI	SMOTE	SMOTE-D	k-means SMOTE
$f 11 x$	2 000	2 000	0.014 3±0.007 2	0.225 9±0.010 6	0.020 4±0.006 5	0.148 6±0.013 4	0.251 8±0.010 4
$f 12 x$	4 000	4 000	0.009 8±0.003 3	0.274 6±0.004 3	0.011 2±0.004 0	0.114 9±0.008 3	0.317 2±0.011 0
$f 13 x$	4 000	4 000	0.011 4±0.002 7	0.417 7±0.003 0	0.013 1±0.004 3	0.100 9±0.002 7	0.170 2±0.003 3
$f 14 x$	8 000	8 000	0.008 0±0.001 7	0.382 8±0.001 6	0.012 0±0.003 1	0.171 9±0.004 5	0.428 0±0.049 2
$f 15 x$	6 000	6 000	0.009 3±0.002 6	0.472 7±0.001 0	0.010 9±0.004 0	0.134 3±0.003 0	0.418 9±0.007 6
$f 16 x$	12 000	12 000	0.007 8±0.002 0	0.354 9±0.000 7	0.009 4±0.001 7	0.117 7±0.002 9	0.586 8±0.004 4
$f 17 x$	8 000	8 000	0.009 4±0.003 2	0.431 1±0.000 5	0.010 9±0.002 8	0.145 4±0.003 5	0.351 1±0.009 6
$f 18 x$	16 000	16 000	0.006 4±0.001 4	0.506 3±0.000 3	0.007 5±0.001 8	0.209 7±0.001 6	0.564 5±0.001 9
$f 19 x$	10 000	10 000	0.006 7±0.001 5	0.632 8±0.000 4	0.008 2±0.002 1	0.203 0±0.002 5	0.774 5±0.007 0
$f 20 x$	20 000	20 000	0.006 3±0.001 3	0.328 9±0.000 6	0.007 2±0.001 0	0.078 3±0.002 0	0.350 6±0.029 9

表3

表4

基于均匀分布的10种分布上5种仿真样本插补方法的KL散度对比

数据集	真实样本点个数	仿真样本点个数	DGM-SSI	RSI	SMOTE	SMOTE-D	k-means SMOTE
$f 11 x$	2 000	2 000	0.029 6±0.004 3	0.055 0±0.008 2	0.077 2±0.008 5	3.505 0±0.913 1	0.281 5±0.024 1
$f 12 x$	4 000	4 000	0.017 8±0.001 5	0.091 3±0.006 6	0.031 1±0.002 9	0.232 0±0.032 5	0.831 6±0.204 7
$f 13 x$	4 000	4 000	0.078 8±0.004 4	0.110 2±0.035 8	0.140 6±0.007 2	1.628 3±0.108 4	0.259 1±0.040 5
$f 14 x$	8 000	8 000	0.062 6±0.002 5	0.292 3±0.043 1	0.084 0±0.003 9	1.641 0±0.161 4	1.073 5±0.265 2
$f 15 x$	6 000	6 000	0.147 1±0.002 7	0.227 6±0.047 0	0.257 1±0.007 9	1.882 2±0.252 5	0.772 9±0.043 4
$f 16 x$	12 000	12 000	0.150 4±0.004 2	0.235 6±0.077 4	0.200 8±0.005 3	1.690 6±0.135 9	2.810 4±0.081 9
$f 17 x$	8 000	8 000	0.222 4±0.004 3	0.362 6±0.160 1	0.362 5±0.013 0	2.085 5±0.191 6	1.210 4±0.032 6
$f 18 x$	16 000	16 000	0.188 9±0.007 3	0.388 7±0.138 8	0.231 2±0.006 0	2.276 8±0.109 6	1.970 0±0.068 4
$f 19 x$	10 000	10 000	0.254 1±0.005 4	0.492 0±0.280 3	0.422 6±0.016 3	2.104 5±0.167 5	7.501 7±2.246 1
$f 20 x$	20 000	20 000	0.339 2±0.006 1	0.479 6±0.055 9	0.439 6±0.009 3	1.418 0±0.056 7	5.222 6±3.551 1

表4

图9

图10

参考文献 27

1	He Haibo, Garcia E A. Learning From Imbalanced Data[J]. IEEE Transactions on Knowledge and Data Engineering, 2009, 21(9): 1263-1284.
2	王乐, 韩萌, 李小娟, 等. 不平衡数据集分类方法综述[J]. 计算机工程与应用, 2021, 57(22): 42-52.
	Wang Le, Han Meng, Li Xiaojuan, et al. Review of Classification Methods for Unbalanced Data Sets[J]. Computer Engineering and Applications, 2021, 57(22): 42-52.
3	李维刚, 甘平, 谢璐, 等. 基于样本对元学习的小样本图像分类方法[J]. 电子学报, 2022, 50(2): 295-304.
	Li Weigang, Gan Ping, Xie Lu, et al. A Few-shot Image Classification Method by Pairwise-based Meta Learning[J]. Acta Electronica Sinica, 2022, 50(2): 295-304.
4	Rathore M M, Shah S A, Shukla D, et al. The Role of AI, Machine Learning, and Big Data in Digital Twinning: A Systematic Literature Review, Challenges, and Opportunities[J]. IEEE Access, 2021, 9: 32030-32052.
5	马子轩, 翟季冬, 韩文弢, 等. 高效训练百万亿参数预训练模型的系统挑战和对策[J]. 中兴通讯技术, 2022, 28(2): 51-58.
	Ma Zixuan, Zhai Jidong, Han Wentao, et al. Challenges and Measures for Efficient Training of Trillion-parameter Pre-trained Models[J]. ZTE Technology Journal, 2022, 28(2): 51-58.
6	Malay H. How Much Training Data do You Need?[EB/OL]. (2015-11-29) [2022-02-10]. .
7	Chawla N V, Bowyer K W, Hall L O, et al. SMOTE: Synthetic Minority Over-sampling Technique[J]. Journal of Artificial Intelligence Research, 2002, 16(1): 321-357.
8	Fredy Rodríguez Torres, Carrasco-Ochoa Jesús A, Mart José Fco Martínez-Trinidad. SMOTE-D a Deterministic Version of SMOTE[C]//Pattern Recognition. Cham: Springer International Publishing, 2016: 177-188.
9	Douzas G, Bacao F, Last F. Improving Imbalanced Learning Through a Heuristic Oversampling Method Based on K-means and SMOTE[J]. Information Sciences, 2018, 465: 1-20.
10	Kovács György. Smote-variants: a Python Implementation of 85 Minority Oversampling Techniques[J]. Neurocomputing, 2019, 366: 352-354.
11	Fernández Alberto, Sara del Río, Chawla N V, et al. An Insight into Imbalanced Big Data Classification: Outcomes and Challenges[J]. Complex & Intelligent Systems, 2017, 3(2): 105-120.
12	Fernández Alberto, García Salvador, Herrera F, et al. SMOTE for Learning from Imbalanced Data: Progress and Challenges, Marking the 15-year Anniversary[J]. Journal of Artificial Intelligence Research, 2018, 61(1): 863-905.
13	Han Hui, Wang Wenyuan, Mao Binghuan. Borderline-SMOTE: A New Over-sampling Method in Imbalanced Data Sets Learning[C]//Advances in Intelligent Computing. Berlin, Heidelberg: Springer Berlin Heidelberg, 2005: 878-887.
14	杨智明, 乔立岩, 彭喜元. 基于改进SMOTE的不平衡数据挖掘方法研究[J]. 电子学报, 2007, 35(增2): 22-26.
	Yang Zhiming, Qiao Liyan, Peng Xiyuan. Research on Datamining Method for Imbalanced Dataset Based on Improved SMOTE[J]. Acta Electronica Sinica, 2007, 35(S2): 22-26.
15	曾志强, 吴群, 廖备水, 等. 一种基于核 SMOTE 的非平衡数据集分类方法[J]. 电子学报, 2009, 37(11): 2489-2495.
	Zeng Zhiqiang, Wu Qun, Liao Beishui, et al. A Classfication Method for Imbalance Data Set Based on Kernel SMOTE[J]. Acta Electronica Sinica, 2009, 37(11): 2489-2495.
16	Ramentol E, Caballero Yailé, Bello R, et al. SMOTE-RSB_^*: A Hybrid Preprocessing Approach Based on Oversampling and Undersampling for High Imbalanced Data-sets Using SMOTE and Rough Sets Theory[J]. Knowledge and Information Systems, 2012, 33(2): 245-265.
17	Pan Tingting, Zhao Junhong, Wu Wei, et al. Learning Imbalanced Datasets Based on SMOTE and Gaussian Distribution[J]. Information Sciences, 2020, 512: 1214-1233.
18	He Yulin, Xu Shengsheng, Huang Zhexue. Creating Synthetic Minority Class Samples Based on Autoencoder Extreme Learning Machine[J]. Pattern Recognition, 2022, 121: 108191.
19	Xuan Guorong, Zhang Wei, Chai Peiqi. EM Algorithms of Gaussian Mixture Model and Hidden Markov Model[C]//Proceedings 2001 International Conference on Image Processing. Piscataway, NJ, USA: IEEE, 2001: 145-148.
20	邢长征, 赵全颖, 王星, 等. 基于鲁棒高斯混合模型的加速EM算法研究[J]. 计算机应用研究, 2017, 34(4): 1042-1046.
	Xing Changzheng, Zhao Quanying, Wang Xing, et al. Accelerated EM Algorithm Research Based on Robust Gaussian Mixture Model[J]. Application Research of Computers, 2017, 34(4): 1042-1046.
21	黄哲学, 何玉林, 魏丞昊, 等. 大数据随机样本划分模型及相关分析计算技术[J]. 数据采集与处理, 2019, 34(3): 373-385.
	Huang Zhexue, He Yulin, Wei Chenghao, et al. Random Sample Partition Data Model and Related Technologies for Big Data Analysis[J]. Journal of Data Acquisition and Processing, 2019, 34(3): 373-385.
22	Levina A, Priesemann V. Subsampling Scaling[J]. Nature Communications, 2017, 8(1): 15140.
23	Chen Y C. A Tutorial on Kernel Density Estimation and Recent Advances[J]. Biostatistics & Epidemiology, 2017, 1(1): 161-187.
24	Jiang H. Uniform Convergence Rates for Kernel Density Estimation[C]//Proceedings of the 34th International Conference on Machine Learning. Chia Laguna Resort, Sardinia, Italy: PMLR, 2017: 1694-1703.
25	Gretton A, Borgwardt K M, Rasch M J, et al. A Kernel Two-sample Test[J]. The Journal of Machine Learning Research, 2012, 13: 723-773.
26	何玉林, 黄德发, 戴德鑫, 等. 最大均方差异统计量的一般界[J]. 应用数学, 2021, 34(2): 284-288.
	He Yulin, Huang Defa, Dai Dexin, et al. General Bounds for Maximum Mean Discrepancy Statistics[J]. Mathematica Applicata, 2021, 34(2): 284-288.
27	Pérez-Cruz Fernando. Kullback-leibler Divergence Estimation of Continuous Distributions[C]//2008 IEEE International Symposium on Information Theory. Piscataway, NJ, USA: IEEE, 2008: 1666-1670.

[1]	吴铮, 张悦, 董泽. 基于改进高斯混合模型的热工过程异常值检测[J]. 系统仿真学报, 2023, 35(5): 1020-1033.
[2]	祁成, 熊伟丽. 基于BGMM的即时学习软测量建模方法[J]. 系统仿真学报, 2019, 31(8): 1555-1561.
[3]	陈芳芳, 计忠平. 人脸肤色检测与编辑[J]. 系统仿真学报, 2019, 31(7): 1377-1386.
[4]	赵帅, 史旭东, 熊伟丽. 一种分层集成的高斯过程回归软测量建模方法[J]. 系统仿真学报, 2019, 31(10): 2042-2051.