系统仿真学报 ›› 2020, Vol. 32 ›› Issue (6): 1038-1050.doi: 10.16182/j.issn1004731x.joss.18-0722

• 仿真建模理论与方法 • 上一篇    下一篇

基于Spark的并行图聚类算法研究

刘东江1,2, 黎建辉1   

  1. 1. 中国科学院计算机网络信息中心,北京 100190;
    2. 中国科学院大学,北京 100190
  • 收稿日期:2018-10-29 修回日期:2019-03-02 发布日期:2020-06-25
  • 作者简介:刘东江(1988-),男,内蒙古,博士,研究方向为数据挖掘、机器学习;黎建辉(1973-),男,湖北,博士,研究员,研究方向为数据密集型计算、数据密集型存储。
  • 基金资助:
    国家重点研发计划(2016YFB1000600),中国科学院战略性先导科技专项(XDA06010307)

Study of Parallelized Graph Clustering Algorithm Based on Spark

Liu Dongjiang1,2, Li Jianhui1   

  1. 1. Computer Network Information Center, Chinese Academy of Sciences, Beijing 100190, China;
    2. University of Chinese Academy of Sciences, Beijing 100190, China
  • Received:2018-10-29 Revised:2019-03-02 Published:2020-06-25

摘要: 对并行图聚类算法进行了研究。基于Spark 提出了一个新的并行图聚类算法;由于Spark 中的top 操作需要耗费大量的内存,提出了一个新算法来替代top 操作,有效减少了所消耗的内存;通过对自底向上的层次聚类算法进行改进提高了聚类的速度;基于图数据的特征提出了一种图数据过滤的方法来减少算法运行的时间以及所占用的空间并对其有效性进行了说明。仿真结果表明,运行效果优于进行比较的其他并行化图聚类算法。

关键词: 图聚类, 图数据, Spark, 算法, 并行化

Abstract: The parallelized graph clustering algorithm is researched. A new parallelized graph clustering algorithm is proposed based on Spark. As the top operation of Spark occupies a lot of memory space, a new algorithm which is used to substitute the top operation is proposed to reduce the memory consumption. By improving bottom up hierarchical clustering algorithm, the speed of the proposed algorithm is improved. A new data filtering method based on the feature of graph data is proposed. By the method, the running time and memory space comsuption is reduced greatly. The reason of the high efficiency of this filtering method is explained. Simulation result indicates that the proposed algorithm is better than other parallelized graph clustering algorithms.

Key words: graph clustering, graph data, Spark, algorithm, parallelize

中图分类号: