Resource Info Paper http://arxiv.org/abs/2501.12948 Code & Data / Public arXiv Date 2025.02.28
本文介绍了 DeepSeek-R1,DeepSeek-R1-Zero 模型。其中 DeepSeek-R1-Zero 是直接使用的 RL 没有经过 SFT 阶段。DeepSeek-R1-Zero是一种通过大规模增强学习(RL)训练的模型,而无需监督微调(SFT)作为初步的步骤,表现出了显着的推理能力。
OpenAI的O1(OpenAI,2024b)系列模型是第一个通过增加思考推理过程的长度来引入推理时间缩放的模型。我们的目标是探索LLMS在没有任何监督数据的情况下开发推理能力的潜力,并通过纯RL过程着重于他们的自我进化。这表明,较大的基本模型发现的推理模式对于提高推理能力至关重要。
在本节中,我们探讨了LLM在没有任何监督数据的情况下发展推理能力的潜力,并通过纯粹的强化学习过程着重于他们的自我进化。
我们采用 Group Relative Policy Optimization (GRPO)
Two types of rewards:
<think>
and </think>
tags.Aha Moment of DeepSeek-R1-Zero
它强调了增强学习的力量和美观:我们没有明确教授有关如何解决问题的模型,而只是为其提供了正确的激励措施,并且自主会自主制定高级问题解决策略。 “ AHA时刻”有力地提醒了RL在人工系统中解锁新智能水平的潜力,为将来的更自主和自适应模型铺平了道路。
首先,将更强大的模型蒸馏成较小的模型会产生出色的结果,而依赖本文提到的大规模RL的较小模型则需要巨大的计算功率,甚至可能无法实现蒸馏的性能。
其次,尽管蒸馏策略既经济又有效,但超越智能界限的前进仍然可能需要更强大的基础模型和大规模的增强学习。
General Capability: Currently, the capabilities of DeepSeek-R1 fall short of DeepSeek-V3 in tasks such as function calling, multi-turn, complex role-playing, and JSON output. Moving forward, we plan to explore how long CoT can be leveraged to enhance tasks in these fields.
Language Mixing: DeepSeek-R1 is currently optimized for Chinese and English, which may result in language mixing issues when handling queries in other languages. For instance, DeepSeek-R1 might use English for reasoning and responses, even if the query is in a language other than English or Chinese. We aim to address this limitation in future updates.
Prompting Engineering: When evaluating DeepSeek-R1, we observe that it is sensitive to prompts. Few-shot prompting consistently degrades its performance. Therefore, we recommend users directly describe the problem and specify the output format using a zero-shot setting for optimal results.
Software Engineering Tasks: Due to the long evaluation times, which impact the efficiency of the RL process, large-scale RL has not been applied extensively in software engineering tasks. As a result, DeepSeek-R1 has not demonstrated a huge improvement over DeepSeek-V3 on software engineering benchmarks. Future versions will address this by implementing rejection sampling on software engineering data or incorporating asynchronous evaluations during the RL process to improve efficiency.
ChapGPT
论文的创新之处与独特性:
论文中存在的问题及改进建议:
基于论文的内容和研究结果,提出的创新点或研究路径:
为新的研究路径制定的研究方案:
研究方案1:无监督数据生成与强化学习结合
研究方案2:多语言一致性推理奖励机制
研究方案3:推理能力的跨领域迁移
本文作者:Geaming
本文链接:
版权声明:本博客所有文章除特别声明外,均采用 BY-NC-SA 许可协议。转载请注明出处!