Research on Proximal Policy Optimization for Autonomous Long-Distance Rapid Rendezvous of Spacecraft
-
摘要: 在考虑地球扁率J2摄动的影响下,文中针对限定携带燃料和限定转移时间下的异面轨道航天器远距离快速转移的最省燃料轨迹优化问题,基于近端策略优化(Proximal Policy Optimization, PPO)设计脉冲机动的时长与脉冲增量大小,实现最省燃料消耗的转移轨迹设计。首先构筑J2摄动下航天器转移变轨的动力学模型,并进行航天器在轨运行中的不确定性分析,其次,将问题转化为最优控制问题,并建立强化学习训练框架;此后,设计基于过程约束和终端约束的合适的奖励函数,提高算法的探索能力和训练过程的稳定性;最后,在该强化学习框架下进行训练得到模型,生成变轨机动策略,并通过仿真并进行对比实验验证算法性能。相较已有DRL方法,文中设计的改进型密集奖励函数结合位置势函数与速度引导机制,显著提升了算法的收敛速度、鲁棒性与燃料优化性能,仿真结果表明,该方法能够很好的生成策略并达到预期抵近要求。Abstract:
Objective With the increasing demands of deep-space exploration, on-orbit servicing, and space debris removal missions, autonomous long-range rapid rendezvous capabilities have become critical for future space operations. Traditional trajectory planning approaches based on analytical methods or heuristic optimization often exhibit limitations when dealing with complex dynamics, strong disturbances, and uncertainties, which often makes it difficult to balance efficiency and robustness. Deep Reinforcement Learning (DRL), by combining the approximation capabilities of deep neural networks with the decision-making strengths of reinforcement learning, enables adaptive learning and real-time decision-making in high-dimensional continuous state and action spaces. In particular, the Proximal Policy Optimization (PPO) algorithm, with its training stability, sample efficiency, and ease of implementation, has emerged as a representative policy gradient method that enhances policy exploration while ensuring stable policy updates. Therefore, integrating DRL with PPO into spacecraft long-range rapid rendezvous tasks can not only overcome the limitations of conventional methods but also provide an intelligent, efficient, and robust solution for autonomous guidance in complex orbital environments. Methods This study first establishes a spacecraft orbital dynamics model incorporating the effects of J2 perturbation, while also modeling uncertainties such as position and velocity measurement errors and actuator deviations during on-orbit operations. Subsequently, the long-range rapid rendezvous problem is formulated as a Markov Decision Process (MDP), with the state space defined by variables including position, velocity, and relative distance, and the action space characterized by impulse duration and direction. The model further integrates fuel consumption and terminal position and velocity constraints. Based on this formulation, a DRL framework leveraging PPO was constructed, in which the policy network outputs maneuver command distributions and the value network estimate state values to improve training stability. To address convergence difficulties arising from sparse rewards, an enhanced dense reward function was designed, combining a position potential function with a velocity-guidance function to guide the agent toward the target while gradually decelerating and ensuring fuel efficiency. Finally, the optimal maneuver strategy for the spacecraft was obtained through simulation-based training, and its robustness was validated under various uncertainty conditions. Results and Discussions Based on the aforementioned DRL framework, a comprehensive simulation was conducted to evaluate the effectiveness and robustness of the proposed improved algorithm. In Case 1, three reward structures were tested: sparse reward, traditional dense reward, and an improved dense reward integrating a relative position potential function and a velocity guidance term. The results indicate that the design of the reward function significantly impacts convergence behavior and policy stability. With a sparse reward structure, the agent lacks process feedback, which hinders effective exploration of feasible actions. The traditional dense reward provides continuous feedback, allowing for gradual convergence toward local optima. However, terminal velocity deviations remain uncorrected in the later stages, leading to suboptimal convergence and incomplete satisfaction of terminal constraints. In contrast, the improved dense reward effectively guides the agent toward favorable behaviors from the early training stages while penalizing undesirable actions at each step, thereby accelerating convergence and enhancing robustness. The velocity guidance term enables the agent to anticipate necessary adjustments during the mid-to-late phases of the approach, rather than postponing corrections until the terminal phase, resulting in more fuel-efficient maneuvers. The simulation results further demonstrate the actual performance: the maneuvering spacecraft executed 10 impulsive maneuvers throughout the mission, achieving a terminal relative distance of 21.326 km, a relative velocity of 0.0050 km/s, and a total fuel consumption of111.2123 kg. Furthermore, to validate the robustness of the trained model against realistic uncertainties in orbital operations,1000 Monte Carlo simulations were performed. As presented inTable 5 , the mission success rate reached 63.40%, with fuel consumption in all trials remaining within acceptable bounds. Finally, to verify the superiority of the PPO algorithm, its performance was compared with that of DDPG in a multi-impulse fast-approach rendezvous mission in Case 2. The results from PPO training show that the maneuvering spacecraft performed 5 impulsive maneuvers, achieving a terminal separation of2.2818 km, a relative velocity of0.0038 km/s, and a total fuel consumption of4.1486 kg. The DDPG training results indicate that the maneuvering spacecraft consumed4.3225 kg of fuel, achieving a final separation of4.2731 km and a relative velocity of0.0020 km/s. Both algorithms successfully fulfill mission requirements, with comparable fuel usage. However, it is noted that DDPG required a training duration of 9 hours and 23 minutes, incurring significant computational resource consumption. In contrast, the PPO training process was relatively more efficient, converging within 6 hours and 4 minutes. Therefore, although DDPG exhibits higher sample efficiency, its longer training cycle and greater computational burden make it less efficient than PPO in practical applications. The comparative analysis demonstrates that the proposed PPO with the improved dense reward significantly enhances learning efficiency, policy stability, and robustness.Conclusions This study addressed the problem of autonomous long-range rapid rendezvous for spacecraft under J2 perturbation and uncertainties, and proposed a PPO-based trajectory optimization method. The results demonstrated that the proposed approach could generate maneuver trajectories satisfying terminal constraints under limited fuel and transfer time, while outperforming conventional methods in terms of convergence speed, fuel efficiency, and robustness. The main contributions of this work are: (1) the development of an orbital dynamics framework that incorporates J2 perturbation and uncertainty modeling, and the formulation of the rendezvous problem as a MDP; (2) the design of an enhanced dense reward function combining a position potential function and a velocity-guidance function, which effectively improved training stability and convergence efficiency; (3) simulation-based validation of PPO’s applicability and robustness in complex orbital environments, providing a feasible solution for future autonomous rendezvous and on-orbit servicing missions. Future work will consider sensor noise, environmental disturbances, and multi-spacecraft cooperative rendezvous in complex mission scenarios, aiming to enhance the algorithm’s practical applicability and generalization to real-world operations. -
表 1 各摄动相对量级(面质比$ 0.01{\mathrm{m}}^{2}/\text{kg} $)
各摄动相对量级 低轨 中轨 高轨 非球形引力J2项
非球形引力其他项
太阳引力
月球引力
大气阻力
太阳光压
潮汐10–3
10–7
10–8
10–7
10–6-10-10
10–8
10–810–4
10–7
10–6
10–6
<10-10
10–7
10–910–4
10–7
10–6
10–5
<10-10
10–7
10-10表 2 机动航天器性能参数
性能参数 性能指标 算例1 算例2 $ {{{m}_{\mathcal{F}}}}_{0} $ [kg]
$ {m}_{0} $ [kg]
$ \dot{m} $ [kg/s]
$ {\boldsymbol{I}}_{\text{sp}} $ [s]
$ \Delta {{{t}_{\mathcal{F}}}}_{\max } $ [s]150
250
1.6
400
2010
20
0.5
400
5表 3 航天器初始轨道参数
轨道参数 算例1 算例1 机动航天器 目标航天器 机动航天器 目标航天器 半长轴 [km]
偏心率
轨道倾角 [°]
升交点赤经[°]
近地点幅角 [°]
真近点角 [°]30378.1363
0.05
25.5
14.4
14.4
7.240378.1363
0.01
18
352.8
14.4
727978.1363
0.05
30
0
10
128378.1363
0.01
31
5
10
18表 4 算法超参数设计
超参数 算例1参数值 算例2参数值 $ {L}_{\text{ub}} $/$ {L}_{\text{lb}} $ [km]
$ {\rho }_{1} $/$ {\rho }_{2} $
$ \tau $ [km]
$ {d}_{\mathrm{c}} $ [km]
$ {N}_{\mathrm{g}} $
$ {c}_{1} $/$ {c}_{2} $/$ {c}_{3} $/$ {c}_{4} $/$ {c}_{5} $/$ {c}_{\text{PBRS}} $/$ {c}_{\parallel } $/$ {c}_{\bot } $
$ \gamma $
Learning Rate
Episode10000 /45000
3e-2/3e-2
1e-4
500
2
0.14/0.4/0.4/0.2/0.2/1e-3/0.18/0.18
0.99
1e-5
6e5300/ 4000
3e-2/3e-2
1e-4
100
2
0.14/0.4/0.4/2/2/1e-3/0.18/0.18
0.9
1e-5
2e5表 5 轨道不确定性高斯拟合噪声设计
不确定性项 均值 标准差 $ \delta {\boldsymbol{r}}_{s,i} $
$ \delta {\boldsymbol{v}}_{s,i} $
$ \delta {\boldsymbol{r}}_{m,i} $
$ \delta {\boldsymbol{v}}_{m,i} $
$ \delta {{{v}_{x}}}_{v,i} $/$ \delta {{{v}_{y}}}_{v,i} $/$ \delta {{{v}_{z}}}_{v,i} $0 km
0 km/s
0 km
0 km/s
01 km
0.02 km/s
1 km
0.02 km/s
0.006表 6 鲁棒性测试结果
测试指标 $ {\mathcal{N}}_{s,i} $ $ {\mathcal{N}}_{m,i} $ $ {\mathcal{N}}_{v,i} $ $ {\mathcal{N}}_{s,i}+{\mathcal{N}}_{m,i}+{\mathcal{N}}_{v,i} $ 成功率 89.30% 87.30% 71.90% 63.40% 燃料消耗 [kg] 平均值 [kg] 111.7205 111.5226 110.9310 111.9222 标准差 [kg] 0.6441 0.5008 0.0782 0.8903 -
[1] LI Weijie, CHENG Dayi, LIU Xigang, et al. On-Orbit Service (OOS) of spacecraft: A review of engineering developments[J]. Progress in Aerospace Sciences, 2019, 108: 32–120. doi: 10.1016/j.paerosci.2019.01.004. [2] NALLAPU R T and THANGAVELAUTHAM J. Design and sensitivity analysis of spacecraft swarms for planetary moon reconnaissance through co-orbits[J]. Acta Astronautica, 2021, 178: 854–896. doi: 10.1016/j.actaastro.2020.10.008. [3] NIU Shangwei, LI Dongxu, and JI Haoran. Research on mission time planning and autonomous interception guidance method for low-thrust spacecraft in long-distance interception[C]. 2020 5th International Conference on Automation, Control and Robotics Engineering (CACRE), Dalian, China, 2020: 117–123. doi: 10.1109/CACRE50138.2020.9230051. [4] LEDKOV A and ASLANOV V. Review of contact and contactless active space debris removal approaches[J]. Progress in Aerospace Sciences, 2022, 134: 100858. doi: 10.1016/j.paerosci.2022.100858. [5] 陈宏宇, 吴会英, 周美江, 等. 微小卫星轨道工程应用与STK仿真[M]. 北京: 科学出版社, 2016. (查阅网上资料, 未找到页码信息, 请确认补充).CHEN Hongyu, WU Huiying, ZHOU Meijiang, et al. Orbit Engineering Application and STK Simulation for Microsatellite[M]. Beijing: Science Press, 2016. (查阅网上资料, 未找到对应的英文翻译信息, 请确认). [6] ABDELKHALIK O and MORTARI D. N-impulse orbit transfer using genetic algorithms[J]. Journal of Spacecraft and Rockets, 2007, 44(2): 456–460. doi: 10.2514/1.24701. [7] PONTANI M, GHOSH P, and CONWAY B A. Particle swarm optimization of multiple-burn rendezvous trajectories[J]. Journal of Guidance, Control, and Dynamics, 2012, 35(4): 1192–1207. doi: 10.2514/1.55592. [8] YU Jing, CHEN Xiaoqian, CHEN Lihu, et al. Optimal scheduling of GEO debris removing based on hybrid optimal control theory[J]. Acta Astronautica, 2014, 93: 400–409. doi: 10.1016/j.actaastro.2013.07.015. [9] MNIH V, KAVUKKCUOGLU K, SILVER D, et al. Playing Atari with deep reinforcement learning[EB/OL]. https://arxiv.org/abs/1312.5602, 2013. [10] DONG Zhicai, ZHU Yiman, WANG Lu, et al. Motion planning of free-floating space robots for tracking tumbling targets by two-axis matching via reinforcement learning[J]. Aerospace Science and Technology, 2024, 155: 109540. doi: 10.1016/j.ast.2024.109540. [11] TIWARI M, PRAZENICA R, and HENDERSON T. Direct adaptive control of spacecraft near asteroids[J]. Acta Astronautica, 2023, 202: 197–213. doi: 10.1016/j.actaastro.2022.10.014. [12] SCORSOGLIO A, FURFARO R, LINARES R, et al. Relative motion guidance for near-rectilinear lunar orbits with path constraints via actor-critic reinforcement learning[J]. Advances in Space Research, 2023, 71(1): 316–335. doi: 10.1016/j.asr.2022.08.002. [13] SCHULMAN J, WOLSKI F, DHARIWAL P, et al. Proximal policy optimization algorithms[EB/OL]. https://arxiv.org/abs/1707.06347, 2017. [14] 王禄丰, 李爽. J2摄动下非线性轨道不确定性传播方法[C]. 2024年中国航天大会论文集, 武汉, 2024: 59–64. doi: 10.26914/c.cnkihy.2024.081107.WANG Lufeng and LI Shuang. Nonlinear orbit uncertainty propagation method under J2 perturbation[C]. Proceedings of 2024 China Aerospace Congress, Wuhan, 2024: 59–64. doi: 10.26914/c.cnkihy.2024.081107. (查阅网上资料,未找到对应的英文翻译信息,请确认). [15] 孙盼, 李爽. 连续方程与高斯和框架下轨道不确定性传播方法综述[J]. 中国科学: 物理学 力学 天文学, 2025, 55(9): 294501. doi: 10.1360/SSPMA-2024-0300.SUN Pan and LI Shuang. A review of uncertainty propagation methods within continuity equation and Gaussian mixture model frameworks[J]. SCIENTIA SINICA Physica, Mechanica & Astronomica, 2025, 55(9): 294501. doi: 10.1360/SSPMA-2024-0300. [16] BAILLIEUL J and SAMAD T. Encyclopedia of Systems and Control[M]. 2nd ed. Cham: Springer, 2021. doi: 10.1007/978-3-030-44184-5. . [17] LANDERS M and DORYAB A. Deep reinforcement learning verification: A survey[J]. ACM Computing Surveys, 2023, 55(14s): 1–31. doi: 10.1145/3596444. [18] XU Haotian, XUAN Junyu, ZHANG Guangquan, et al. Trust region policy optimization via entropy regularization for Kullback-Leibler divergence constraint[J]. Neurocomputing, 2024, 589: 127716. doi: 10.1016/j.neucom.2024.127716. [19] IBRAHIM S, MOSTAFA M, JNADI A, et al. Comprehensive overview of reward engineering and shaping in advancing reinforcement learning applications[J]. IEEE Access, 2024, 12: 175473–175500. doi: 10.1109/access.2024.3504735. [20] PAOLO G, CONINX M, LAFLAQUIÈRE A, et al. Discovering and exploiting sparse rewards in a learned behavior space[J]. Evolutionary Computation, 2024, 32(3): 275–305. doi: 10.1162/evco_a_00343. [21] NG A Y, HARADA D, and RUSSELL S. Policy invariance under reward transformations: Theory and application to reward shaping[R]. 2016. (查阅网上资料, 未找到本条文献信息, 请确认). [22] MONTENBRUCK O and GILL E. Satellite Orbits: Models, Methods and Applications[M]. Berlin: Springer, 2000. doi: 10.1007/978-3-642-58351-3. . [23] ZAVOLI A and FEDERICI L. Reinforcement learning for robust trajectory design of interplanetary missions[J]. Journal of Guidance, Control, and Dynamics, 2021, 44(8): 1440–1453. doi: 10.2514/1.G005794. -
下载:
下载: