Source: https://www.youtube.com/watch?v=dGoxEpoacy0
Author: Adam Lucek
Related: Videos

Summary

I Trained an LLM to Think Deeper (Here’s How)

Key Insights & Takeaways

Reinforcement learning (RL) is the core of deeper LLM reasoning – the speaker demonstrates that a well‑structured policy‑gradient approach can produce human‑like, multi‑step reasoning when applied to a large language model.
PPO for context framing – the policy‑gradient algorithm used is Proximal Policy Optimization (PPO). By feeding the model a prompt + contextual reward signal, PPO learns to prioritize more accurate, multi‑headed responses.
GRPO – The “Guided RPO” algorithm – an extension of PPO that adds a guidance loss to keep the model’s policy close to a set of “expert demonstrations” (e.g., human‑generated reasoning traces). This prevents drift and speeds up convergence.
DeepSeek‑R1 training pipeline – a zero‑shot pre‑training of a policy network that is then fine‑tuned with GRPO. It consists of:
1. Model loading – load a large transformer as the base.
2. Dataset preparation – construct a faithful, curated set of question‑answer pairs with multi‑step reasoning traces.
3. Reward specification – use a reward function that rewards correct chain‑of‑thought steps and penalizes hallucinations.
4. GRPO trainer – the PPO loop with guidance loss, trained for several million steps.
Practical performance – the fine‑tuned model shows a marked improvement in “think‑deeper” tasks, achieving higher accuracy on benchmarks like GSM‑8K and the new DeepSeek‑R1 tasks.
Implementation resources – code (GitHub), the trained model (Hugging Face), and research papers (DeepSeek‑R1, DeepSeek‑Math). These provide a ready‑to‑run starter kit for anyone wanting to replicate or extend the results.
Takeaway for practitioners – if you want a smarter LLM, spend time on RL‑based fine‑tuning rather than just supervised training. Use PPO/GRPO with a carefully crafted reward model to shape the policy.
Future directions – explore sparse reward shaping, curriculum learning, or combining RL with retrieval‑augmented generation for even deeper reasoning.

Short Recap

Stage	Action	Why it Matters
1. Load base LLM	Start with a solid foundation (Qwen‑2.5‑3B‑Instruct)	Provides the large parameter space for nuanced reasoning
2. Build dataset	Curate high‑quality chain‑of‑thought examples	Supplies the signal for RL to learn realistic reasoning paths
3. PPO baseline	Train policy gradient with context reward	Gives the model the ability to self‑direct based on reward
4. Add GRPO guidance	Constrain policy to stay near expert traces	Prevents collapse and speeds up learning
5. Evaluate on benchmarks	Measure depth of reasoning	Confirms that RL adds measurable reasoning improvements

Bottom line: Reinforcement learning, when carefully orchestrated via PPO and a guidance‑regularized variant (GRPO), can make a large language model reason more deeply and accurately. The video provides both conceptual explanations and actionable code.

Architecture

PPO

graph TD
    Q["Query (q)"] --> PM["Policy Model"]
    PM --> O["Output (o)"]
    Q --> RefM["Reference Model"]
    O --> RefM
    O --> RewM["Reward Model"]
    RefM --> KL["KL Penalty"]
    RewM --> R["Reward (r)"]
    KL --> R
    Q --> VM["Value Model"]
    VM --> V["Value (v)"]
    R --> GAE["GAE Advantage (A)"]
    V --> GAE
    GAE -.->|feedback| PM

GRPO

graph TD
    Q["Query (q)"] --> PM["Policy Model"]
    PM --> O1["Output (o_1)"]
    PM --> O2["Output (o_2)"]
    PM --> On["Output (o_n)"]
    Q --> RefM["Reference Model"]
    O1 --> RefM
    O2 --> RefM
    On --> RefM
    O1 --> RewM["Reward Model"]
    O2 --> RewM
    On --> RewM
    RefM --> KL["KL Penalty"]
    RewM --> R["Rewards (r_1...r_n)"]
    KL --> R
    R --> Avg["Average Reward"]
    R --> Adv["Advantage (A)"]
    Avg --> Adv
    Adv -.->|ranked relative optimization| PM

Notes

PPO
- high level review
- starting with a Policy Model
  - policy in this case indicates the behavior of the model
- q -> o
- Reference Model - Reward Model -> KL penalty -> r
- Value Model -> v
- GAE -> A -> feedback
GRPO
- DeepSeek team modified PPO in various ways
- inefficient to maintain a whole other value model
- Value model helps when you care about intermediate steps
  - if you can compare final answer to ground truth you can do without
- q -> multiple outputs o_n
- Reference and Reward models
- average reward
- advantage calculated based on the difference of each reward to the average
- then ranked rs, so you get optimization for relative ranking instead of absolute values

How GRPO was applied

the algorithm taught the LLMs to think through problems longer without explicitly defining any functions that encourage reasoning or thinking length

DeepSeek began training using pure GRPO RL on DeepSeek-R1-Zero

base model DeepSeek-V3 LLM
prompt template

A conversation between User and Assistant. The user asks a question and
the Assistant solves it. The assistant first thinks about the reasoning
process in the mind and then provides the user with the answer.
The reasoning process and answer are enclose within
<think></think> and <answer></answer> respectively. User: _prompt_ Assistant:

math, code, logic based questions
verifiable outcomes

results
- exhibit behavior of reflection, reevaluating prior steps
- explore possible alternative methods
- use more test-time compute and generate longer answer as it reflects and explore more possibilities
full pipeline for DeepSeek-R1
1. SFT with Long CoT examples
- labeled reasoning from DeepSeek-Zero
- Few Shot synthetic data generation
- this develops the style, not the reasoning yet
1. GRPO RL for Reasoning
- coding, math, science, logic examples for verifiability
- generate 800k examples non-reasoning specific data (writing, factual, translation, fuzzy examples)
1. SFT for 2 epochs with these examples
2. RL alignment
with this they matched o1 with much less computer requirements

Denial

Explorer

I Trained an LLM to Think Deeper (Here's How)

Summary

I Trained an LLM to Think Deeper (Here’s How)

Architecture

PPO

GRPO

Notes

Graph View

Table of Contents

Backlinks