基于Spark的并行图聚类算法研究

doi:10.16182/j.issn1004731x.joss.18-0722

系统仿真学报 ›› 2020, Vol. 32 ›› Issue (6): 1038-1050.doi: 10.16182/j.issn1004731x.joss.18-0722

基于Spark的并行图聚类算法研究

刘东江^1,2, 黎建辉¹

1. 中国科学院计算机网络信息中心，北京 100190;
2. 中国科学院大学，北京 100190

收稿日期:2018-10-29 修回日期:2019-03-02 发布日期:2020-06-25
作者简介:刘东江(1988-)，男，内蒙古，博士，研究方向为数据挖掘、机器学习；黎建辉(1973-)，男，湖北，博士，研究员，研究方向为数据密集型计算、数据密集型存储。
基金资助:
国家重点研发计划(2016YFB1000600)，中国科学院战略性先导科技专项(XDA06010307)

Study of Parallelized Graph Clustering Algorithm Based on Spark

Liu Dongjiang^1,2, Li Jianhui¹

1. Computer Network Information Center, Chinese Academy of Sciences, Beijing 100190, China;
2. University of Chinese Academy of Sciences, Beijing 100190, China

Received:2018-10-29 Revised:2019-03-02 Published:2020-06-25

摘要/Abstract

摘要： 对并行图聚类算法进行了研究。基于Spark 提出了一个新的并行图聚类算法；由于Spark 中的top 操作需要耗费大量的内存，提出了一个新算法来替代top 操作，有效减少了所消耗的内存；通过对自底向上的层次聚类算法进行改进提高了聚类的速度；基于图数据的特征提出了一种图数据过滤的方法来减少算法运行的时间以及所占用的空间并对其有效性进行了说明。仿真结果表明，运行效果优于进行比较的其他并行化图聚类算法。

关键词: 图聚类, 图数据, Spark, 算法, 并行化

Abstract: The parallelized graph clustering algorithm is researched. A new parallelized graph clustering algorithm is proposed based on Spark. As the top operation of Spark occupies a lot of memory space, a new algorithm which is used to substitute the top operation is proposed to reduce the memory consumption. By improving bottom up hierarchical clustering algorithm, the speed of the proposed algorithm is improved. A new data filtering method based on the feature of graph data is proposed. By the method, the running time and memory space comsuption is reduced greatly. The reason of the high efficiency of this filtering method is explained. Simulation result indicates that the proposed algorithm is better than other parallelized graph clustering algorithms.

Key words: graph clustering, graph data, Spark, algorithm, parallelize

中图分类号:

TP391

刘东江, 黎建辉. 基于Spark的并行图聚类算法研究[J]. 系统仿真学报, 2020, 32(6): 1038-1050.

Liu Dongjiang, Li Jianhui. Study of Parallelized Graph Clustering Algorithm Based on Spark[J]. Journal of System Simulation, 2020, 32(6): 1038-1050.

参考文献

[1] Girvan M, Newman M E J. Community structure in social and biological networks[J]. Proceedings of the national academy of sciences (S0027-8424), 2002, 99(12): 7821-7826.
[2] Peel L, Larremore D B, Clauset A. The ground truth about metadata and community detection in networks[J]. Science advances (S2375-2548), 2017, 3(5): e1602548.
[3] 刘世超, 朱福喜, 甘琳. 基于标签传播概率的重叠社区发现算法[J]. 计算机学报, 2016, 39(4): 717-729.
Liu Shichao, Zhu Fuxi, Gan Lin. A Label- Propagation-Probability-Based Algorithm for Overlapping Community Detection[J]. Chinese Journal of Computers, 2016, 39(4): 717-729.
[4] 辛宇, 杨静, 谢志强. 一种面向语义重叠社区发现的Link-Block 算法[J]. 软件学报, 2016, 27(2): 363-380.
Xin Yu, Yang Jing, Xie Zhiqiang. Link-Block Method for the Semantic Overlapping Community Detection[J]. Journal of Software, 2016, 27(2): 363-380.
[5] 汪焱, 黄发良, 元昌安. 基于标签影响力的半同步社区发现算法[J]. 计算机应用, 2016, 36(6): 1573-1578.
Wang Yan, Huang Faliang, Yuan Changan. Semi-synchronous communities detection algorithm based on label influence[J]. Journal of Computer Applications, 2016, 36(6): 1573-1578.
[6] Huttlin E L, Bruckner R J, Paulo J A, et al. Architecture of the human interactome defines protein communities and disease networks[J]. Nature (S0028-0836), 2017, 545(7655): 505.
[7] Vehlow C, Beck F, Auwärter P, et al. Visualizing the evolution of communities in dynamic graphs[J]. Computer Graphics Forum, 2015, 34(1): 277-288.
[8] 李春英, 汤庸, 汤志康, 等. 面向大规模学术社交网络的社区发现模型[J]. 计算机应用, 2015, 35(9): 2565-2568.
Li Chunying, Tang Yong, Tang Zhikang, et al. Community detection model in large scale academic social networks[J]. Journal of Computer Applications, 2015, 35(9): 2565-2568.
[9] 王莉, 程学旗. 在线社会网络的动态社区发现及演化[J]. 计算机学报, 2015, 38(2): 219-237.
Wang Li, Cheng Xueqi. Dynamic Community in Online Social Networks[J]. Chinese Journal of Computers, 2015, 38(2): 219-237.
[10] Wickramaarachchi C, Frincu M, Small P, et al. Fast parallel algorithm for unfolding of communities in large graphs[C]. High PERFORMANCE Extreme Computing Conference. Waltham, MA, USA: IEEE, 2014: 1-6.
[11] Lu H, Halappanavar M, Kalyanaraman A. Parallel heuristics for scalable community detection[J]. Parallel Computing (S0167-8191), 2015, 47: 19-37.
[12] Moon S, Lee J G, Kang M. Scalable community detection from networks by computing edge betweenness on MapReduce[C]. International Conference on Big Data and Smart Computing. Bangkok, Thailand: IEEE, 2014: 145-148.
[13] Ling X, Yang J, Wang D, et al. Fast Community Detection in Large Weighted Networks Using GraphX in the Cloud[C]. International Conference on High PERFORMANCE Computing and Communications; IEEE, International Conference on Smart City; IEEE, International Conference on Data Science and Systems. Sydney, Australia: IEEE, 2017: 1-8.
[14] Zhang Q, Qiu Q, Guo W, et al. A social community detection algorithm based on parallel grey label propagation[J]. Computer Networks (S1389-1286), 2016, 107(1): 133-143.
[15] Newman M E J. Fast algorithm for detecting community structure in networks[J]. Physical review E (S1539-3755), 2004, 69(6): 066133.

基于Spark的并行图聚类算法研究

Study of Parallelized Graph Clustering Algorithm Based on Spark

PDF (PC)

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

本文评价

[1]	李智杰, 石昊琦, 李昌华, 张颉. 基于改进遗传算法的影像中心布局优化方法[J]. 系统仿真学报, 2022, 34(6): 1173-1184.
[2]	段绍米, 罗会龙, 刘海鹏. 人群搜索和樽海鞘群的混合算法优化PID参数[J]. 系统仿真学报, 2022, 34(6): 1230-1246.
[3]	陈麒, 崔昊杨. 基于改进鸽群层级的无人机集群视觉巡检模型[J]. 系统仿真学报, 2022, 34(6): 1275-1285.
[4]	张森, 张孟炎, 邵敬平, 普杰信. 基于随机策略搜索的多机三维路径规划方法[J]. 系统仿真学报, 2022, 34(6): 1286-1295.
[5]	杜宝林, 朱大昌, 盘意华. 机械臂模糊超螺旋二阶滑模轨迹跟踪控制[J]. 系统仿真学报, 2022, 34(6): 1343-1352.
[6]	倪凌佳, 黄晓霞, 李红旮, 张子博. 基于协作式深度强化学习的火灾应急疏散仿真研究[J]. 系统仿真学报, 2022, 34(6): 1353-1366.
[7]	蒙盾, 胡卓, 张华军. 基于改进A^*算法的多层邮轮疏散系统仿真[J]. 系统仿真学报, 2022, 34(6): 1375-1382.
[8]	梁江涛, 王慧琴. 基于改进蚁群算法的建筑火灾疏散路径规划研究[J]. 系统仿真学报, 2022, 34(5): 1044-1053.
[9]	张其文, 张斌. 基于教学优化算法求解置换流水车间调度问题[J]. 系统仿真学报, 2022, 34(5): 1054-1063.
[10]	邓向阳, 张立民, 方伟, 汤淼. 基于双向汇聚引导蚁群算法的机器人路径规划[J]. 系统仿真学报, 2022, 34(5): 1101-1108.
[11]	王宁, 代冀阳, 应进. 基于改进势场的无人机编队恢复与一致性仿真[J]. 系统仿真学报, 2022, 34(5): 978-993.
[12]	付建林, 丁国富, 张剑, 江海凡, 郭沛佩. 基于响应面和NSGA-II的AGV系统多目标优化配置[J]. 系统仿真学报, 2022, 34(5): 994-1002.
[13]	宁小娟, 李洁茹, 高凡, 王映辉. 基于最佳几何约束和RANSAC的特征匹配算法[J]. 系统仿真学报, 2022, 34(4): 727-734.
[14]	高鑫宇, 倪静. 救援效率视角下灾后动态应急配送网络优化[J]. 系统仿真学报, 2022, 34(4): 806-816.
[15]	马立新, 程颍. 计及可中断负荷的园区综合能源系统优化调度[J]. 系统仿真学报, 2022, 34(4): 817-825.