Reinforcement Learning from Human Feedback (RLHF)

PPO Objective (Clipped Surrogate)

where

Reward Model Objective