编辑
2025-06-20
NLP
0

目录

Raise the Ceiling: Clip-Higher

Raise the Ceiling: Clip-Higher

Cite from: DAPO: An Open-Source LLM Reinforcement Learning System at Scale

In our initial experiments using naive PPO or GRPO, we observed the entropy collapse phenomenon: the entropy of the policy decreases quickly as training progresses. The sampled responses of certain groups tend to be nearly identical. This limited exploration and early deterministic policy, which can hinder the scaling process.

We propose the Clip-Higher strategy to address this issue. Clipping over the importance sampling ratio is introduced in Clipped Proximal Policy Optimization (PPO-Clip) to restrict the trust region and enhance the stability of RL. We identify that the upper clip can restrict the exploration of the policy, where making an 'exploitation' token more probable is much easier yet the probability of an unlikely 'exploration' token is too tightly bounded to bu uplifted.

Concretely, when ε=0.2\varepsilon = 0.2 (the default value of most algorithms) and A^i,t>0\hat{A}_{i,t} > 0 (the system tries to increase the probability), consider two actions with probabilities πθold(oiq)=0.01\pi_{\theta_{\text{old}}}(o_i \mid q) = 0.01 and 0.90.9. The upper bounds of the increased probabilities πθ(oiq)\pi_\theta(o_i \mid q) are 0.0120.012 and 1.081.08, respectively (πθold(1+ε)\pi_{\theta_{\text{old}}} \cdot (1 + \varepsilon)). This implies that 'exploitation' tokens with a higher probability (e.g., 0.90.9) are not constrained to get even extremely larger probabilities like 0.9990.999. Conversely, for low-probability 'exploration' tokens, achieving a non-trivial increase in probability is considerably more challenging. Empirically, we also observe that the mean probability of up-clipped tokens is low: πθ(oiq)<0.2\pi_\theta(o_i \mid q) < 0.2 (Figure 3a). This finding supports our intuition that the upper clipping threshold indeed restricts the probability increase of low-probability 'exploration' tokens, thereby potentially constraining the exploration of the system.

image.png

Adhering to the Clip-Higher strategy, we decouple the lower and higher clipping range as εlow\varepsilon_{\text{low}} and εhigh\varepsilon_{\text{high}}, as highlighted in Equation 10:

JDAPO(θ)=E(q,a)D,{oi}i=1Gπθold(q)[1i=1Goii=1Gt=1oimin(ri,t(θ)A^i,t,clip(ri,t(θ),1εlow,1+εhigh)A^i,t)]s.t.0<{oiis_equivalent(a,oi)}<G.\begin{aligned} \mathcal{J}_{\text{DAPO}}(\theta) = & \mathbb{E}_{(q,a) \sim \mathcal{D}, \{o_i\}_{i=1}^G \sim \pi_{\theta_{\text{old}}}(\cdot|q)} \\ & \left[\frac{1}{\sum_{i=1}^G |o_i|} \sum_{i=1}^G \sum_{t=1}^{|o_i|} \min\left(r_{i,t}(\theta) \hat{A}_{i,t}, \text{clip}\left(r_{i,t}(\theta), 1 - \varepsilon_{\text{low}}, 1 + \varepsilon_{\text{high}}\right) \hat{A}_{i,t}\right)\right] \\ \text{s.t.} \quad & 0 < |\{o_i \mid \text{is\_equivalent}(a, o_i)\}| < G. \end{aligned}

We increase the value of εhigh\varepsilon_{\text{high}} to leave more room for the increase of low-probability tokens. As shown in Figure 2, this adjustment effectively enhances the policy's entropy and facilitates the generation of more diverse samples. We keep εlow\varepsilon_{\text{low}} as it is, because increasing it will suppress the probability of these tokens to 0, resulting in the collapse of the sampling space.

本文作者:Geaming

本文链接:

版权声明:本博客所有文章除特别声明外,均采用 BY-NC-SA 许可协议。转载请注明出处!