Dense process rewards, which provide feedback at each intermediate step rather than only the whole trajectory, have proven effective in inference-time scaling of large language models (LLMs) on challenging reasoning tasks.
how to acquire and utilize high-quality dense rewards at scale?
PRIME:
serves as a general method to fuse token-level dense rewards and sparse outcome rewards by calculating their returns separately before summing together, which is compatible with diverse RL algorithms
eliminates the dedicated reward modeling stage, which is required by existing works, by simply initializing from the SFT model or even the base model.
Key challenges in scalable dense rewards:
Process rewards are hard to define
PRM online updates are not scalable
Explicit reward modeling brings extra cost
rϕ:=βlogπref(yt∣y<t)πϕ(yt∣y<t)
In PRIME, upon rollouts being generated and graded by the (ground truth) outcome verifier, we update the Implicit PRM online with on-policy rollouts and outcome supervision and then calculate token-level dense rewards to estimate advantages.
More specifically, we use an Implicit PRM πϕ and an outcome verifier or reward model ro. We calculate the return of implicit process rewards and outcome rewards separately if both are available, since directly mixing their values may lead to numerical instability. For implicit process rewards, we perform a three-step proces to calculate return: (1) Use the averaged implicit process rewards to calculate the leave-one-out baseline; (2) Normalize the process reward at step t by subtracting the baseline; (3) Calculate the discounted return for each response. For outcome rewards, we directly adopt RLOO without any modification. Finally, the advantage is set to the combination of both returns:
Ati=s=t∑∣yi∣γs−t⋅RLOO with implicit process rewards⎣⎡rϕ(ysi)−K−11j=i∑rϕ(yj)⎦⎤+RLOO with outcome rewardsro(yi)−K−11j=i∑ro(yj)
Updating policy with PPO clip surrogate loss. We adopt PPO clip surrogate loss for more stable policy updates:
Online Prompt Filtering. As we sample multiple trajectories for each prompt, we introduce online prompt filtering for each prompt, we introduce online prompt filtering which a certain accuracy range. This (1) preserves only the prompts within a certain median-level difficulty range and (2) balances data distribution for the Implicit PRM online training.
Rule-based Outcome Verifier. Consistent with recent research that adopts exact match with ground truth as unhackable rewards, we define the rule-based ground truth outcome verifiers (OV) for math and coding as follows: