검색 상세

Gain-Risk Control Framework for Stabilization of Deep Policy Gradient Optimization

초록/요약

Deep policy gradient optimization aims at maximizing the expected objective value via policy updates based on gradient ascent. However, policy updates at individual samples of the trajectory do not necessarily guarantee the improvement of the policy parameterized by a neural network at all states, resulting in permanent performance degradation or trapping at local optima. One of the reasons eliciting these phenomena is due to policy update via sample mean of the policy gradient objective, which results a mixed positive and negative updates at sample states when neural network policy is used. The nature of policy parameterization by neural network itself induces this behavior leading to lower performance, which we show through a minimal toy example. To address this issue, we propose a gain-risk framework for the analysis of policy updates at individual states in deep policy gradient optimization, where the gain of a state is defined as the expected increase in the objective value of a state, while the risk of a state as an expected loss of the objective value after the policy update. This framework provides a new viewpoint for the policy update step, where the maximization of the expected objective value is seen as the decision of maximizing gains while taking risks. This viewpoint allows deploying risk control strategies, which we develop a regularization methods controlling minimum risk and maximum absolute log-policy ratio in a sample batch of the trajectory, therefore, stabilizing the policy update step. Also, adaptive regularization method is proposed to ease hyperparameter tuning and controlling of gain-risk trade-off strategy. Simulation results of proposed regularization methods exhibit stable and gradual training leading to large final performance improvement in continuous control and locomotion tasks.

more

목차

Chapter 1. Introduction 1
1.1 Preliminaries 3
1.2 Related Works 8
Chapter 2. Stability of Policy Gradient Algorithms 10
2.1 Minimal Example 10
Chapter 3. Gain-Risk Framework 14
3.1 Gain-Risk Definition 14
3.1.1 Gain 15
3.1.2 Risk 15
3.2 Gain-Risk Trade-off 16
3.3 Risk Regularization 23
3.4 Adaptive Regularization 26
Chapter 4. Experiments 28
4.1 Environments and Implementation Details 28
4.2 Policy Gradient Method through Gain-Risk Framework 30
4.3 Effect of Risk Regularization 33
4.4 Performance of Proposed Regularization Methods 36
4.5 Hyperparameters 40
Chapter 5. Conclusion 44
Bibliography 45

more