系统仿真学报 ›› 2023, Vol. 35 ›› Issue (9): 1948-1964.doi: 10.16182/j.issn1004731x.joss.22-0554

• 论文 • 上一篇    下一篇

基于数据生成模型的仿真样本点插补方法

何玉林1,2(), 陈佳琪2, 徐贺鹏2, 黄哲学1,2, 尹剑飞2   

  1. 1.人工智能与数字经济广东省实验室 (深圳),广东 深圳 518107
    2.深圳大学 计算机与软件学院,广东 深圳 518060
  • 收稿日期:2022-05-25 修回日期:2022-07-28 出版日期:2023-09-25 发布日期:2023-09-19
  • 第一作者简介:何玉林(1982-),男,副研究员,博士,研究方向为大数据近似计算、多样本统计分析理论、数据挖掘与机器学习算法及应用。 E-mail:yulinhe@gml.ac.cn
  • 基金资助:
    国家自然科学基金面上项目(61972261);深圳市基础研究重点项目(JCYJ20220818100205012);深圳市基础研究项目(JCYJ20210324093609026)

Data Generation Model-based Synthetic Sample Imputation Method

He Yulin1,2(), Chen Jiaqi2, Xu Hepeng2, Huang Zhexue1,2, Yin Jianfei2   

  1. 1.Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ), Shenzhen 518107, China
    2.College of Computer Science and Software Engineering, Shenzhen University, Shenzhen 518060, China
  • Received:2022-05-25 Revised:2022-07-28 Online:2023-09-25 Published:2023-09-19

摘要:

为解决插补的仿真样本点与真实样本点概率分布不一致的问题,提出了基于数据生成模型的仿真样本点插补方法。基于高斯混合模型构建真实样本点的数据生成模型,其对应的高斯混合模型构件数通过多模型融合的策略确定;利用在真实样本点上获得的数据生成模型插补所需的仿真样本点,其中数据生成模型的构件以及构件权重用于控制仿真样本点的生成方式。在20个多模多维混合分布上对新方法的可行性和有效性进行了验证,实验结果表明,与随机样本点插补、合成少类过采样技术及其两种最新的变体等4种方法相比,本文方法能够获得更具概率分布一致性的仿真样本点,证实该方法是一种合理的仿真样本点插补方法。

关键词: 仿真样本点插补, 数据生成模型, 高斯混合模型, 合成少类过采样技术, 概率分布一致

Abstract:

In order to solve the problem of inconsistent probability distribution between synthetic samples by imputation and real samples, a data generation model-based synthetic sample imputation (DGM-SSI) method is proposed. The data generation model of real samples is constructed based on the Gaussian mixture model, and the number of corresponding components of the Gaussian mixture model is determined by the multi-model fusion strategy. The synthetic samples required for model imputation are generated by using the data obtained from the real samples. Specifically, the components of the data generation model and their weights are used to control the generation of synthetic samples. The feasibility and effectiveness of the DGM-SSI method are verified on 20 multi-model and multi-dimensional mixed distributions. The experiment result shows that compared with random sample imputation, synthetic minority over-sampling technique (SMOTE), and its two latest variants, the proposed method can obtain synthetic samples with a more consistent probability distribution, which proves that this method is a reasonable synthetic sample imputation method.

Key words: synthetic sample imputation, data generation model, Gaussian mixture model, synthetic minority over-sampling technique, probability distribution consistency

中图分类号: