系统仿真学报 ›› 2024, Vol. 36 ›› Issue (9): 2208-2218.doi: 10.16182/j.issn1004731x.joss.23-0584

• 研究论文 • 上一篇    

基于改进近端策略优化的空战自主决策研究

钱殿伟1, 齐红敏1, 刘振2, 周志明2, 易建强2   

  1. 1.华北电力大学 控制与计算机工程学院,北京 102206
    2.中国科学院 自动化研究所,北京 100190
  • 收稿日期:2023-05-18 修回日期:2023-06-16 出版日期:2024-09-15 发布日期:2024-09-30
  • 通讯作者: 周志明
  • 第一作者简介:钱殿伟(1980-),男,副教授,博士,研究方向为工智能技术与机器人系统。

Research on Autonomous Decision-making in Air-combat Based on Improved Proximal Policy Optimization

Qian Dianwei1, Qi Hongmin1, Liu Zhen2, Zhou Zhiming2, Yi Jianqiang2   

  1. 1.School of Control and Computer Engineering, North China Electric Power University, Beijing 102206, China
    2.Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China
  • Received:2023-05-18 Revised:2023-06-16 Online:2024-09-15 Published:2024-09-30
  • Contact: Zhou Zhiming

摘要:

针对传统强化学习在空战自主决策应用中信息冗余度高、收敛速度慢等问题,提出一种基于双重观测与复合奖励的近端策略优化空战自主决策算法。设计了以交互信息为主、个体特征信息为辅的双重观测信息,降低战场信息高度冗余对训练效率的影响;设计了结果奖励和过程奖励相结合的复合奖励函数,提高了训练过程收敛速度;采用广义优势函数估计,改进了近端策略优化算法,提高优势函数估计的准确性。仿真结果表明:在对战固定程控对手和矩阵博弈对手实验场景中,该算法决策模型均可根据战场态势准确进行自主决策,完成空战任务。

关键词: 强化学习, 空战自主决策, 双重观测, 复合奖励, 广义优势函数估计

Abstract:

To address the problems of high information redundancy and slow convergence speed of traditional reinforcement learning in air-combat autonomous decision-making applications, a proximal policy optimization air-combat autonomous decision-making method, based on dual observation and composite reward is proposed. A dual observation space, which contains interaction information as the main information and individual feature information as a supplement, was designed to reduce the influence of redundant battlefield information on the training efficiency of the decision model. A composite reward function combining result reward and process reward was designed to improve convergence speed. The generalized advantage estimator was applied in the proximal policy optimization strategy algorithm to improve the accuracy of advantage function estimation. Simulation results show that the method decision-making model can make precise autonomous decisions and complete air-combat tasks according to the battlefield situation in two types of experimental scenarios: against fixed-programmed and matrix gaming opponents.

Key words: RL, air-combat autonomous decision-making, dual observation, composite reward, generalized advantage estimator

中图分类号: