Reinforcement Learning from Human Feedback (RLHF)

PPO Objective (Clipped Surrogate)

$L^{C L I P} (θ) = \hat{E}_{t} [min (r_{t} (θ) \hat{A}_{t}, clip (r_{t} (θ), 1 - ϵ, 1 + ϵ) \hat{A}_{t})]$ where $r_{t} (θ) = \frac{π _{θ} ( a _{t} ∣ s _{t} )}{π _{old} ( a _{t} ∣ s _{t} )}$

Reward Model Objective

$L (r_{ϕ}) = - E_{(x, y_{w}, y_{l}) \sim D} [lo g σ (r_{ϕ} (x, y_{w}) - r_{ϕ} (x, y_{l}))]$

Denial

Explorer

RLHF

Reinforcement Learning from Human Feedback (RLHF)

PPO Objective (Clipped Surrogate)

Reward Model Objective

Graph View

Table of Contents

Backlinks