现有的 verifier 基本上都是建立在 math 或者 code 这种很容易进行验证的任务上。为了进一步推动强化学习在更多领域上的应用,很明显需要 General Verifier 来进行协助。为了应对挑战,作者的关键观察是LLM产生正确的自由形式答案的内在概率直接表明其对推理奖励的评估。作者提出了使用模型自身在 ground truth 上的概率来表示其推理路径是否正确,从而实现一种能在广泛领域应用的 verifer,并且这样这样不会引入额外的模型,并且获得了不错的效果。
Related Work
Self-Reward Optimization Unsupervised reinforcement learning on language models using the policy model its self as a reward has recently emerged as an embarrassingly effective approach. The common idea behind the practice of self-reward is raising the probabilities of consistent answers, intuitively from the observation that concentrating on the majority brings fee improvements. Recent literature shows that entropy minimization, which naively degrades generation diversity, is a sugar for reasoning tasks, though restricted to certain model families. However, such practice might be prob-lematic for restricting exploration. In contrast to self-rewarding methods that remove diversity to exploit existing reasoning ability, our apporach builds the reward based on the reference answer, yielding reasoning performance with healthy token entropy from the clip-high trick.
Main Content
The basic insight is that LLM's intrinsic probability of generating a correct answer directly indicates its own evaluation of the reasoning reward.
RLPR introduces two key innovations:
At the reward modeling level, we propose a simple and scalable alternative to the explicit reward from external verifiers with an instrinsic Probability-based Reward (PR), calculated by the average decoding probabilities of the reference answer tokens.
At the training level, we propose an adaptive curriculum learning mechanism to stabilize training.
Contributions:
We present RLPR, a simple and scalable framework that extends RLVR to general domains without using external verifiers.
We propose a novel probability reward that eliminates the need for external verifiers and achieves better reward quality than naive likelihood as a reward.
We introduce a novel standard deviation filtering strategy that effectively stabilizes training by removing samples with low reward standard deviation.
We conduct comprehensive experiments to demonstrate that effectiveness of the proposed framework on various base models from Qwen, Llama and Gemma.
Probability Reward
2.2 概率奖励
受到大型语言模型(LLM)生成正确答案的内在概率直接表明其对推理质量的内部评估这一观察的启发,我们使用参考答案的逐词元解码概率作为奖励信号。与依赖特定领域验证器的现有方法(Cui et al., 2025a; Luo et al., 2025a)不同,这些方法需要大量的人工启发式方法和工程努力来构建验证器,我们的奖励计算过程仅涉及模型本身。
其中 fseq 将逐词元概率聚合为响应 o 的单一奖励标量。虽然使用 fseq=n∏(概率的归一化乘积,即序列似然)反映了参考答案的整体似然,但我们观察到它引入了高方差,并且对微小变化(如同义词)过于敏感。例如,词元概率序列 (0.01,0.7,0.9) 和 (0.05,0.7,0.9) 在乘积下产生了截然不同的分数,尽管只有第一个词元存在微小差异。为了解决这个问题,我们改为采用 fseq=∣y∗∣1∑(平均概率),这产生了更稳健的奖励信号,并在我们的分析中展现出与答案质量更好的相关性(见图4)。我们观察到概率奖励值与生成答案 y 的质量高度一致,当预测答案在语义上与参考答案相似时获得高奖励,否则分配低奖励。请注意,长度归一化步骤对于GRPO(Shao et al., 2024)来说是多余的,但对于不包含组归一化的算法(如REINFORCE++(Hu et al., 2025a))可能至关重要。
Reward Debiasing
尽管基于概率的奖励与响应质量强相关,但它们也受到各种潜在因素的影响。我们将概率奖励 r 的贡献者记为 Ur,它本质上可以分解为两个潜在因素:
General domains: MMLU-Pro, GPQA, TheoremQA, WebInstruct
PR discriminates correct responses better than the rule-based verifier on general data. 为了评估不同奖励能够区分正确和错误回复的能力(即,将更高的奖励分配给纠正回复),作者根据各自的奖励对每个提示的回复与人工标注的 ground truth 进行 ROC-AUC 计算。
Ablation Study
🤖
论文的创新之处与独特性:
创新点1:无验证器的奖励框架
论文提出了RLPR(Reinforcement Learning with Reference Probability Reward)框架,通过使用LLM的内在概率(Intrinsic Probability)作为奖励信号,取代传统的基于领域验证器的奖励机制。这种方法消除了对外部验证器的依赖,显著降低了复杂性并提高了可扩展性。