Resource Info Paper http://arxiv.org/abs/2505.18454 Code & Data https://github.com/Yueeeeeeee/HRPO Public arXiv Date 2025.07.10
作者提出了针对于 latent reasoning 的 RL 算法 (HRPO)。
We first describe our notation and settings for hybird latent reasoning. For input query and its corresponding token embeddings , we describe the raw hidden states from the LLM output at step with , namely:
in which denotes the transformer model (i.e., decoder layers), represents the final-layer hidden states produced by the . With the LM head (), the next output token can be sampled from the output distribution over the vocabulary via:
However, hidden states often lie outside the model's token embedding manifold, which degrades generation quality when fed directly. To avoid this, we project back into the embedding space to ensure the inputs conform to the model's learned distribution. Specifically, we use the output probabilities to compute a weighted interpolation over the vocabulary:
in which is the temperature and denotes the embedding matrix of the LLM. In other words, we compute the next input embeddingg as a weighted sum of all token embeddings, with weights given by . In addition, is normalized to preserve the scale and variance of the output vector. This sampling-free mapping ensures differentiability and aligns the projected embedding with the model's native input space, thus leading to improved training dynamics.
While interpolated embeddings preserve semantic continuity, directly feeding as the next token input removes stochasticity and injects noise from irrelevant tokens, causing degraded generation within RL rollouts. As such, we design a hybird approach for latent reasoning by gradually imposing hidden state representations into the sampled token embeddings with a gating mechanism. Drawing on gated recurrence models, we formulate the gating mechanism as:
is the resulting hybird input for the next step, denotes the embedding of the sampled discrete token , whereas is the projected hidden states. The gates and leverages sigmoid function to control the blending, scales , is a fixed scaling constant, and is a learnable vector. Note that hybird reasoning only applies during the reasoning phase (i.e., ), while the final answer is still generated via standard autoregressive decoding. By initializing , the inputs first draw predominantly from the sampled token embeddings, thereby effectively preserving the LLM's generative capabilities. As the training progresses, the value range of converages to an optimum range and thus incorporates informative features from both hidden representations and sampled tokens.
As such, our HRPO implementation remains light weight, strictly on-policy and could be seamlessly combined with further RL optimizations.
Exp 1: open-domain & multi-hop knowledge-intensive question answering (Knowledge)
Exp 2: science, technology, engineering or mathematics (STEM) benchmarks.
Different Strategies for Latent Reasoning. We compare different strategies to compute latent representations. Specifically, we use three methods to integrate hidden states into RL and train the 1.5B Qwen model on the MATH dataset. These variants are: (1) hidden states, which use the final layer hidden states as the next input; (2) interpolation, which employs interpolated embeddings; and (3) HRPO, our hybird latent reasoning. We visualize the exponential moving average (EMA) of rewards along with the GRPO baseline. Due to the mismatch between hidden states and embeddings, using hidden states degrades generation and yields nonosensical rollouts with zero reward. Although interpolation performs similar to HRPO for the first few hundred steps, the rewards eventually collapse and only slowly recover, likely because interpolation introduces excessive noise. We also provide a direct comparison between HRPO and latent reasoning mehtods. Overall, our approach achieves superior training dynamics with faste convergence while maintaining stability comparable to GRPO, highlighting the efficacy of our hybird design choice in HRPO.
论文的创新之处与独特性:
论文中存在的问题及改进建议:
基于论文的内容和研究结果,提出的创新点或研究路径:
为新的研究路径制定的研究方案:
研究路径1:基于强化学习的动态任务适配框架
研究路径2:跨语言隐式推理优化
研究路径3:隐状态与外部知识库的结合
本文作者:Geaming
本文链接:
版权声明:本博客所有文章除特别声明外,均采用 BY-NC-SA 许可协议。转载请注明出处!