Metadata
- Title: DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
- Authors: DeepSeek-AI
- Publication Date: January 2025
- Source: Local PDF (
~/Documents/DeepSeek-R1-v2.pdf)
Summary
The paper demonstrates that advanced reasoning capabilities in Large Language Models (LLMs) can be developed primarily through Reinforcement Learning (RL), minimizing the need for extensive human-annotated reasoning data.
Core Concepts
- Emergent Reasoning via Pure RL: The researchers developed DeepSeek-R1-Zero using pure RL without prior Supervised Fine-Tuning (SFT). By simply rewarding correct final answers, the model naturally developed sophisticated problem-solving behaviors, including self-reflection, verification, and dynamic strategy adaptation.
- Group Relative Policy Optimization (GRPO): The underlying RL algorithm utilized. GRPO simplifies the training process by eliminating the need for a separate value model (unlike PPO), instead estimating advantages based on group scores. This significantly reduces memory overhead.
- Addressing Readability: DeepSeek-R1-Zero struggled with language mixing and poor readability. To resolve this, the authors created DeepSeek-R1 using a multi-stage pipeline that incorporates limited SFT alongside RL to align the model with human formatting preferences while preserving reasoning strength.
- Distillation: The reasoning patterns learned by the massive DeepSeek-R1 model were distilled into smaller open-weight models (based on Llama and Qwen architectures, ranging from 1.5B to 70B parameters), yielding significant performance gains for smaller architectures without requiring large-scale RL.
Training Pipeline for DeepSeek-R1
To help you visualize the process, here is the structured evolution used to create the final DeepSeek-R1 model from its base:
graph TD A[DeepSeek-V3 Base] -->|Small set of high-quality CoT data| B(Cold Start SFT) B -->|Enhance reasoning & language consistency| C(Reasoning-Oriented RL) C -->|600k Reasoning + 200k General Data| D(Rejection Sampling & General SFT) D -->|Helpfulness & Harmlessness| E(Alignment RL) E --> F[DeepSeek-R1]
Performance
- DeepSeek-R1 achieves performance on par with closed-source frontier reasoning models (such as OpenAI’s o1-1217) across major reasoning benchmarks, including AIME 2024, MATH-500, and Codeforces.