Reinforcement Learning with Verifiable Rewards (RLVR)

Group Relative Policy Optimization (GRPO) Objective

$L (θ) = E [\frac{1}{G} \sum_{i = 1}^{G} (min (\frac{π _{θ} ( a _{i} ∣ s )}{π _{old} ( a _{i} ∣ s )} A_{i}, clip (\frac{π _{θ} ( a _{i} ∣ s )}{π _{old} ( a _{i} ∣ s )}, 1 - ϵ, 1 + ϵ) A_{i}) - β D_{K L} (π_{θ} ∥ π_{ref}))]$

where advantages are group-normalized: $A_{i} = \frac{R _{i} - mean ({ R _{1} , \dots , R _{G} })}{std ({ R _{1} , \dots , R _{G} })}$

Denial

Explorer

RLVR

Reinforcement Learning with Verifiable Rewards (RLVR)

Group Relative Policy Optimization (GRPO) Objective

Graph View

Table of Contents

Backlinks