Reinforcement Learning from Human Feedback (RLHF)
PPO Objective (Clipped Surrogate)
LCLIP(θ)=E^t[min(rt(θ)A^t,clip(rt(θ),1−ϵ,1+ϵ)A^t)]
where rt(θ)=πold(at∣st)πθ(at∣st)
Reward Model Objective
L(rϕ)=−E(x,yw,yl)∼D[logσ(rϕ(x,yw)−rϕ(x,yl))]