Reinforcement Learning with Verifiable Rewards (RLVR)

Group Relative Policy Optimization (GRPO) Objective

where advantages are group-normalized: