Reinforcement Learning with Verifiable Rewards (RLVR) Group Relative Policy Optimization (GRPO) Objective L(θ)=E[G1∑i=1G(min(πold(ai∣s)πθ(ai∣s)Ai,clip(πold(ai∣s)πθ(ai∣s),1−ϵ,1+ϵ)Ai)−βDKL(πθ∥πref))] where advantages are group-normalized: Ai=std({R1,…,RG})Ri−mean({R1,…,RG})