Policy Gradient Theorem

Related: Reinforcement Learning, RL, PPO

The Core Problem

In Reinforcement Learning, we want to maximize the expected total reward, denoted as $J (θ)$ . The objective function is defined as: $J (θ) = E_{τ \sim π_{θ}} [R (τ)]$ where $τ$ is a trajectory (a sequence of states and actions), $π_{θ}$ is our policy parameterized by $θ$ , and $R (τ)$ is the total reward of that trajectory.

To maximize this, we need the gradient of the objective with respect to our parameters: $\nabla_{θ} J (θ)$ .

The Dilemma

The expected return depends on two things:

Our policy $π_{θ} (a ∣ s)$ (which we control).
The environment’s transition dynamics $P (s^{'} ∣ s, a)$ (which we do not know and cannot differentiate).

If we naively try to take the derivative of the expectation, we run into the problem that the distribution of trajectories itself depends on $θ$ .

The Theorem

The Policy Gradient Theorem proves that we can compute the gradient of the expected return without knowing the derivative of the state distribution (the environment dynamics).

The gradient elegantly simplifies to: $\nabla_{θ} J (θ) = E_{τ \sim π_{θ}} [\sum_{t = 0}^{T} \nabla_{θ} lo g π_{θ} (a_{t} ∣ s_{t}) R (τ)]$

The Derivation (Log-Derivative Trick)

The proof relies on a simple calculus trick.

Expand the expectation into a sum over all trajectories: $\nabla_{θ} J (θ) = \nabla_{θ} \sum_{τ} P (τ ∣ θ) R (τ) = \sum_{τ} \nabla_{θ} P (τ ∣ θ) R (τ)$
Apply the log-derivative trick: Recall that $\nabla_{x} lo g f (x) = \frac{\nabla _{x} f ( x )}{f ( x )}$ , which means $\nabla_{x} f (x) = f (x) \nabla_{x} lo g f (x)$ . Substitute this into our equation: $\sum_{τ} P (τ ∣ θ) \nabla_{θ} lo g P (τ ∣ θ) R (τ) = E_{τ \sim π_{θ}} [\nabla_{θ} lo g P (τ ∣ θ) R (τ)]$
Decompose the trajectory probability: The probability of a trajectory is the product of initial state probability, policy probabilities, and environment transition probabilities: $P (τ ∣ θ) = P (s_{0}) \prod_{t = 0}^{T} π_{θ} (a_{t} ∣ s_{t}) P (s_{t + 1} ∣ s_{t}, a_{t})$
Take the log: The product becomes a sum: $lo g P (τ ∣ θ) = lo g P (s_{0}) + \sum_{t = 0}^{T} lo g π_{θ} (a_{t} ∣ s_{t}) + \sum_{t = 0}^{T} lo g P (s_{t + 1} ∣ s_{t}, a_{t})$
Take the gradient: Because the initial state and environment dynamics do not depend on $θ$ , their derivatives are zero. This is the crucial step. They vanish entirely: $\nabla_{θ} lo g P (τ ∣ θ) = \sum_{t = 0}^{T} \nabla_{θ} lo g π_{θ} (a_{t} ∣ s_{t})$

Why this is brilliant

We completely bypassed the environment dynamics.
We only need to differentiate our own policy network.
This formula mathematically justifies trial-and-error: we run trajectories in the environment, collect rewards $R (τ)$ , and push the log-probabilities of the actions we took up or down proportional to the reward received.

Denial

Explorer

Policy Gradient Theorem

The Core Problem

The Dilemma

The Theorem

The Derivation (Log-Derivative Trick)

Why this is brilliant

Graph View

Table of Contents

Backlinks