DQN and PPO — A Complete Guide to Deep Reinforcement Learning

Deep Q-Networks (DQN) and Proximal Policy Optimization (PPO) are arguably the two most important algorithms in modern deep reinforcement learning. DQN brought RL into the deep learning era by playing Atari games at superhuman levels. PPO became the go-to algorithm for everything from robotic control to fine-tuning large language models (RLHF).

Yet most explanations either stay too abstract or drown in notation. This guide walks through both algorithms from the ground up — with intuition first, math second, and concrete examples throughout. By the end, you’ll understand why each design choice exists and when to pick one algorithm over the other.

We’ll use a real-world example to ground the concepts: a university timetabling system where an RL agent learns to select repair strategies for a genetic algorithm. The state is a 39-dimensional vector describing the current population, and the agent chooses from 6 discrete actions (different repair configurations). This is a great testbed because the environment is non-stationary — the population evolves every generation — which exposes the fundamental difference between DQN and PPO.

Background: The RL Problem Setup
Deep Q-Network (DQN)
Proximal Policy Optimization (PPO)
DQN vs PPO — When to Use Which

1. Background: The RL Problem Setup

Both DQN and PPO solve the same fundamental problem: given an environment modeled as a Markov Decision Process (MDP), learn a policy that maximizes cumulative reward.

MDP tuple: ⟨S, A, P, R, γ⟩

Symbol	Meaning	Example (Timetabling)
S	State space	39-dimensional population snapshot ∈ [0,1]³⁹
A	Action space	6 discrete repair configurations {0, …, 5}
P	Transition probabilities	Determined by the genetic algorithm + repair step
R	Reward function	Improvement in constraint violations
γ	Discount factor	Typically 0.99

The agent’s goal: Find a policy π(a|s) that maximizes the expected discounted return:

$G_{t} = \sum_{k = 0}^{\infty} γ^{k} r_{t + k}$

Key RL Concepts You Need First

Before diving into DQN and PPO, let’s establish the foundational concepts that both algorithms rely on.

Model-Free Reinforcement Learning

RL algorithms can be model-based (learn a model of the environment and plan with it) or model-free (learn directly from experience without building a model).

Both DQN and PPO are model-free. They don’t try to predict what happens when an action is taken — they just try actions, observe outcomes, and learn from those outcomes. This is crucial for complex environments where building an accurate model would be impractical.

Discount Factor (γ)

The discount factor γ (between 0 and 1) controls how much the agent cares about future rewards versus immediate ones.

γ = 0: Only care about immediate reward. Very short-sighted.
γ = 0.99: Care almost equally about near and distant future rewards. A reward 50 steps away is worth 0.99⁵⁰ ≈ 0.605 of its face value.
γ = 1: Care equally about all future rewards (can cause instability).

Setting γ = 0.99 makes the agent prioritize long-term outcomes — essential when early decisions affect later performance.

Temporal Difference (TD) Learning

Both algorithms use TD learning at their core. Instead of waiting until the end of an episode to learn (Monte Carlo), TD methods update value estimates after every single step:

$V (s_{t}) \leftarrow V (s_{t}) + α [r_{t} + γV (s_{t + 1}) - V (s_{t})]$

The bracketed term is the TD error (δ):

$δ_{t} = r_{t} + γV (s_{t + 1}) - V (s_{t})$

If δ > 0, the outcome was better than expected. If δ < 0, it was worse. DQN uses TD learning to update Q-values; PPO uses TD errors to compute advantages via GAE.

Gradient Descent vs Gradient Ascent

Neural networks learn by adjusting weights to improve an objective. The direction depends on whether we’re minimizing a loss or maximizing a reward.

Gradient descent (minimizing):

$θ \leftarrow θ - α \nabla_{θ} L (θ)$

DQN uses gradient descent. Its loss is the squared TD error — the network adjusts weights to make Q-value predictions more accurate.

Gradient ascent (maximizing):

$θ \leftarrow θ + α \nabla_{θ} J (θ)$

PPO uses gradient ascent. Its objective J(θ) is the expected reward under the current policy. The policy gradient theorem gives us the gradient:

$\nabla_{θ} J (θ) = E_{π_{θ}} [\nabla_{θ} lo g π_{θ} (a_{t} ∣ s_{t}) \cdot \hat{A}_{t}]$

The intuition: for each action taken, check the advantage (was it better or worse than average?). If positive, increase that action’s probability. If negative, decrease it. The magnitude controls how much.

Value-Based vs Policy-Based Methods

This is the fundamental split between DQN and PPO.

Value-based (DQN): Learn a value function Q(s, a) that estimates how good each action is. The policy is derived indirectly — always pick the highest Q-value action.

State s → Q-network → Q(s, a₀)=2.1, Q(s, a₁)=5.3, Q(s, a₂)=3.7, ...
                       → Pick a₁ (highest Q)

Policy-based (PPO): Learn a policy π(a|s) that directly outputs action probabilities.

State s → Policy network → π(a₀|s)=0.05, π(a₁|s)=0.60, π(a₂|s)=0.15, ...
                            → Sample from this distribution

Value-based methods are deterministic (always pick max Q), which limits exploration. Policy-based methods are stochastic, allowing natural exploration and smoother learning.

On-Policy vs Off-Policy Learning

Off-policy (DQN): Can learn from experience collected by a different policy. DQN stores past transitions in a replay buffer and samples from them during training — even if those transitions came from an old, random policy.

On-policy (PPO): Can only learn from experience collected by its current policy. PPO collects fresh transitions, trains on them for a few epochs, then discards them. No replay buffer.

This distinction has huge practical consequences: off-policy is more data-efficient (reuses old experience), but on-policy avoids learning from stale data — crucial in non-stationary environments.

Summary

Concept	DQN	PPO
Model-free?	Yes	Yes
What it learns	Value function Q(s,a)	Policy π(a\|s) + Value V(s)
Value-based or Policy-based?	Value-based	Policy-based (actor-critic)
On-policy or Off-policy?	Off-policy (replay buffer)	On-policy (fresh rollouts)
Exploration method	ε-greedy (forced randomness)	Entropy bonus (natural stochasticity)
Stability mechanism	Target network	Clipped surrogate objective

2. Deep Q-Network (DQN)

Core idea: Learn a value function Q(s, a) that estimates the expected return of taking action a in state s, then always pick the action with the highest Q-value.

Reference: Mnih et al., “Human-level control through deep reinforcement learning” (Nature, 2015)

2.1 Q-Learning Foundation

Q-learning is based on the Bellman optimality equation — the recursive relationship between the value of a state-action pair and the values of successor states:

$Q^{*} (s, a) = E [r + γ max_{a^{'}} Q^{*} (s^{'}, a^{'})]$

This says: “The true value of taking action a in state s equals the immediate reward plus the discounted value of the best action in the next state.”

Tabular Q-learning update:

$Q (s, a) \leftarrow Q (s, a) + α [r + γ max_{a^{'}} Q (s^{'}, a^{'}) - Q (s, a)]$

where:

α = learning rate (step size)
r + γ max Q(s’, a’) = the TD target (what Q should be)
The bracketed term = TD error δ (the surprise signal)

Over many updates, Q(s, a) converges to the true expected return for each state-action pair.

2.2 From Q-Tables to Deep Q-Networks

With continuous state dimensions, a lookup table is impossible (infinite states). DQN uses a neural network Q_θ(s, a) parametrized by weights θ to approximate Q:

$Q_{θ} : R^{d_{s}} \to R^{∣ A ∣}$

The network takes the state vector as input and outputs one Q-value per action.

Example architecture:

Input:   s ∈ ℝ³⁹  (state vector)
           ↓
Hidden:  Linear(39 → 64) → ReLU
           ↓
Hidden:  Linear(64 → 64) → ReLU
           ↓
Output:  Linear(64 → 6)  (Q-values for 6 actions)

Loss function: Mean squared TD error:

$L (θ) = E_{(s, a, r, s^{'}) \sim B} [(r + γ max_{a^{'}} Q_{θ^{-}} (s^{'}, a^{'}) - Q_{θ} (s, a))^{2}]$

where θ⁻ are the target network parameters (explained next).

2.3 Experience Replay

Problem: Consecutive transitions (s₁→s₂→s₃→…) are highly correlated. Training on correlated data causes instability and catastrophic forgetting.

Solution: Store all transitions in a replay buffer of fixed capacity and sample random minibatches:

$D = {(s_{t}, a_{t}, r_{t}, s_{t + 1})}_{t = 1}^{∣ D ∣}$

At each training step, sample a minibatch uniformly at random, compute TD targets, and update via gradient descent.

Why it works:

Breaks temporal correlations — random sampling decorrelates training data
Data efficiency — each transition gets reused many times
Smooth learning — averages over many past behaviors

2.4 Target Network

Problem: In the loss function, both the prediction Qθ(s, a) and the target r + γ max Qθ(s’, a’) depend on the same weights θ. When θ changes, the target moves — chasing a moving goalpost causes oscillation and divergence.

Solution: Maintain a separate target network Q_θ⁻ with frozen weights, updated only periodically:

$y_{t} = r_{t} + γ max_{a^{'}} Q_{θ^{-}} (s_{t + 1}, a^{'})$

Every C steps: θ⁻ ← θ (hard copy)

The target network provides stable learning targets for C steps at a time. The main network chases a target that only moves occasionally — not every gradient step.

2.5 ε-Greedy Exploration

Problem: If the agent always picks argmax Q(s, a), it never tries new actions and might miss better options.

Solution: With probability ε, take a random action; otherwise, take the greedy action:

$a_{t} = {Uniform (A) ar g max_{a} Q_{θ} (s_{t}, a) with probability ε with probability 1 - ε$

ε starts high (1.0 = fully random) and decays to a small value (0.05 = mostly greedy):

$ε_{t} = max (ε_{m i n}, ε_{start} - \frac{t}{T _{decay}} \cdot (ε_{start} - ε_{m i n}))$

2.6 Complete DQN Algorithm

ALGORITHM: Deep Q-Network (DQN)
─────────────────────────────────────────────────────────
Initialize Q-network Q_θ with random weights θ
Initialize target network Q_θ⁻ ← θ
Initialize replay buffer D with capacity N

For episode = 1, 2, ..., M:
    Reset environment → initial state s₀

    For t = 0, 1, ..., T_max - 1:
        // ACTION SELECTION (ε-greedy)
        With probability ε: a_t ← random action
        Otherwise:          a_t ← argmax_a Q_θ(s_t, a)

        // EXECUTE & STORE
        Execute a_t → observe r_t, s_{t+1}
        Store (s_t, a_t, r_t, s_{t+1}) in D

        // LEARN
        Sample random minibatch from D
        Compute targets: yⱼ = rⱼ + γ · max_{a'} Q_θ⁻(s'ⱼ, a')
        Update θ by minimizing: L = mean(yⱼ - Q_θ(sⱼ, aⱼ))²

        // TARGET & EXPLORATION UPDATES
        Every C steps: θ⁻ ← θ
        ε ← max(ε_min, ε - Δε)

DQN: Strengths and Weaknesses

Strengths:

Simple to implement and debug
Data-efficient through experience replay
Well-understood convergence properties

Weaknesses:

Only works with discrete action spaces
Replay buffer stores stale transitions — problematic in non-stationary environments
ε-greedy exploration is crude (random noise, not directed)
Can overestimate Q-values (ameliorated by Double DQN)

3. Proximal Policy Optimization (PPO)

Core idea: Directly optimize the policy π_θ(a|s) using gradient ascent on expected reward, but constrain updates to stay close to the current policy for stability.

Reference: Schulman et al., “Proximal Policy Optimization Algorithms” (2017)

3.1 Policy Gradient Foundation

Instead of learning a value function and deriving the policy, policy gradient methods directly parametrize and optimize the policy.

The policy gradient theorem (Sutton et al., 1999):

$\nabla_{θ} J (θ) = E_{π_{θ}} [\nabla_{θ} lo g π_{θ} (a_{t} ∣ s_{t}) \cdot G_{t}]$

If action a_t led to high return G_t, increase its probability. If it led to low return, decrease it.

Problem with vanilla policy gradient: High variance (G_t can be very noisy), and large gradient steps can destroy the policy catastrophically.

3.2 The Trust Region Problem

A large policy update can ruin performance irreversibly:

Old policy:  π_old(action_A) = 0.4,  π_old(action_B) = 0.3, ...
             ↓ large gradient step ↓
New policy:  π_new(action_A) = 0.01, π_new(action_B) = 0.95, ...

This discards everything the policy learned. We need to constrain how much the policy can change per update.

TRPO solves this with a hard KL-divergence constraint, but requires expensive second-order optimization. PPO achieves the same goal with a simple first-order clipping trick.

3.3 PPO Clipped Surrogate Objective

Step 1 — Importance sampling ratio: How much did the policy change for this action?

$r_{t} (θ) = \frac{π _{θ} ( a _{t} ∣ s _{t} )}{π _{θ_{old}} ( a _{t} ∣ s _{t} )}$

r_t = 1.0: policies agree perfectly
r_t = 1.5: new policy is 50% more likely to take this action
r_t = 0.7: new policy is 30% less likely

Step 2 — Clipped surrogate:

$L^{CLIP} (θ) = E_{t} [min (r_{t} (θ) \hat{A}_{t}, clip (r_{t} (θ), 1 - ε, 1 + ε) \hat{A}_{t})]$

where ε is the clip range (typically 0.2).

How clipping works:

Advantage Â_t	Ratio r_t	What happens	Effect
Â > 0 (good action)	r_t > 1+ε	Probability increasing too fast	Clips at 1+ε → limits increase
Â > 0 (good action)	r_t < 1+ε	Probability increasing moderately	No clip → normal gradient
Â < 0 (bad action)	r_t < 1−ε	Probability decreasing too fast	Clips at 1−ε → limits decrease
Â < 0 (bad action)	r_t > 1−ε	Probability decreasing moderately	No clip → normal gradient

Worked example: The agent took an action that turned out great (Â = +5.0). Old policy probability was 0.30, new policy pushed it to 0.55.

$r_{t} = \frac{0.55}{0.30} = 1.833$

With ε = 0.2, the clip range is [0.8, 1.2]:

Unclipped: 1.833 × 5.0 = 9.17
Clipped: 1.2 × 5.0 = 6.0
min(9.17, 6.0) = 6.0 — the objective is capped, preventing an overly aggressive update.

3.4 Generalized Advantage Estimation (GAE)

The advantage Â_t measures how much better an action was compared to the average:

$\hat{A}_{t} = Q (s_{t}, a_{t}) - V (s_{t})$

We don’t know Q exactly, so GAE provides a smooth bias-variance tradeoff using TD residuals:

$δ_{t} = r_{t} + γV (s_{t + 1}) - V (s_{t})$

GAE combines multiple TD residuals with exponential weighting:

$\hat{A}_{t}^{GAE (γ, λ)} = \sum_{l = 0}^{T - t - 1} (γλ)^{l} δ_{t + l}$

The λ parameter controls the tradeoff:

λ value	Equivalent to	Bias	Variance	Behavior
λ = 0	1-step TD	High	Low	Only immediate TD error
λ = 1	Monte Carlo	Low	High	Full episode return
λ = 0.95	Blends ~20 steps	Moderate	Moderate	Good practical tradeoff

3.5 Actor-Critic Architecture

PPO uses an actor-critic network with a shared feature extractor and two heads:

Input: state vector s
           ↓
    ┌──────────────────────┐
    │   Shared Feature      │
    │   Extractor           │
    │   (MLP with ReLU)     │
    └───────┬───────┬───────┘
            ↓       ↓
    ┌───────┴──┐ ┌──┴───────┐
    │  ACTOR   │ │  CRITIC  │
    │  π(a|s)  │ │  V(s)    │
    │ Softmax  │ │  Scalar  │
    └──────────┘ └──────────┘

The Actor outputs action probabilities. The Critic outputs a state value estimate used to compute advantages. Sharing the feature extractor is efficient — understanding the environment state is useful for both selecting actions and predicting value.

3.6 Action Masking (Maskable PPO)

In many environments, certain actions are invalid in certain states. Maskable PPO handles this natively by setting masked action logits to −∞ before the softmax:

$z_{i}^{masked} = {z_{i} - \infty if action i is valid if action i is invalid$

Since softmax(−∞) = 0, masked actions get exactly zero probability — no post-hoc renormalization needed.

3.7 Total PPO Loss Function

The total loss combines three terms:

$L (θ) = policy improvement - L_{t}^{CLIP} (θ) + value accuracy c_{1} (V_{θ} (s_{t}) - V_{t}^{targ})^{2} - exploration bonus c_{2} \cdot H [π_{θ}] (s_{t})$

Clipped surrogate (policy loss): Improves action selection within the trust region.

Value function loss: Trains the critic to accurately predict state values: $L^{V} = (V_{θ} (s_{t}) - V_{t}^{targ})^{2}$

Entropy bonus: Prevents the policy from becoming deterministic too early: $H [π_{θ}] (s) = - \sum_{a} π_{θ} (a ∣ s) lo g π_{θ} (a ∣ s)$

High entropy = maximum exploration. Low entropy = committed exploitation. The coefficient c₂ controls the balance.

3.8 Complete PPO Algorithm

ALGORITHM: Proximal Policy Optimization (PPO)
─────────────────────────────────────────────────────────
Initialize actor-critic network with random weights θ
Set: ε (clip), γ (discount), λ (GAE), c₁, c₂, K (epochs), N (rollout steps)

For iteration = 1, 2, ...:

    // COLLECT ROLLOUTS (on-policy)
    For t = 0, ..., N-1:
        Sample a_t ~ π_θ(·|s_t)
        Execute a_t → observe r_t, s_{t+1}
        Store (s_t, a_t, r_t, V_θ(s_t), log π_θ(a_t|s_t))

    // COMPUTE ADVANTAGES (GAE, backward pass)
    For t = N-1, ..., 0:
        δ_t = r_t + γ·V(s_{t+1}) - V(s_t)
        Â_t = δ_t + (γλ)·Â_{t+1}
    Returns: R_t = Â_t + V(s_t)

    // OPTIMIZE (K epochs over same rollout)
    For epoch = 1, ..., K:
        For each minibatch:
            r_t(θ) = π_θ(a_t|s_t) / π_old(a_t|s_t)
            L_clip = min(r_t·Â_t, clip(r_t, 1-ε, 1+ε)·Â_t)
            L = -L_clip + c₁·(V_θ(s_t) - R_t)² - c₂·H[π_θ]
            θ ← θ - α·∇_θ L

    // DISCARD OLD DATA (on-policy — no replay buffer)
    θ_old ← θ

Critical difference from DQN: PPO discards all collected data after each update. It always trains on fresh rollouts from the current policy. This avoids the stale-data problem but is less data-efficient.

4. DQN vs PPO — When to Use Which

Algorithmic Comparison

Aspect	DQN	PPO
What it learns	Value function Q(s,a)	Policy π(a\|s) + value V(s)
Policy type	Deterministic + ε-greedy	Stochastic (probability distribution)
Data usage	Off-policy (replay buffer)	On-policy (fresh rollouts)
Stability mechanism	Target network	Clipped surrogate objective
Exploration	ε-greedy (random noise)	Entropy bonus (natural stochasticity)
Action masking	Post-hoc, ad-hoc	Native (MaskedSoftmax)
Data efficiency	Higher (reuses transitions)	Lower (discards after update)
Non-stationary envs	Poor (stale replay data)	Good (always fresh data)
Continuous actions	Not supported	Supported (Gaussian policy)

When DQN is Better

Stationary environments where old experience stays relevant (e.g., Atari games)
Data-scarce settings where you need to squeeze learning from every transition
Simple discrete action spaces with clear optimal actions
When simplicity matters — DQN is easier to implement and debug

When PPO is Better

Non-stationary environments where dynamics shift over time
Continuous or large discrete action spaces
When you need action masking for constraint satisfaction
Multi-task or curriculum learning where reward functions change
When stability is critical — PPO’s clipped updates are more predictable

The Non-Stationarity Problem (Why PPO Often Wins)

This is worth emphasizing because it’s the most common failure mode of DQN in practice:

DQN replay buffer problem:
  Step 10:  store (s, a, r=+50, s')  ← collected when dynamics were X
  Step 100: sample for training       ← dynamics are now Y
                                      ← old transition is misleading!

PPO on-policy solution:
  Step 100: collect fresh rollouts with current policy
            → advantages from current value estimates
            → no stale data contamination

In the timetabling example, the population changes every generation. A transition recorded when violations were at 1,400 is useless when they’re down to 70 — the dynamics are completely different. DQN’s replay buffer keeps learning from these stale transitions, while PPO always uses fresh data matching the current state.

The result: PPO achieved 7.8× higher mean reward and showed a clear upward learning trend, while DQN’s reward curve stayed flat after early convergence.

Key Takeaways

DQN learns what actions are worth (Q-values). PPO learns what to do (policy probabilities). Both are valid; the choice depends on your problem.
Experience replay makes DQN data-efficient but vulnerable to non-stationarity. On-policy collection keeps PPO data-fresh but data-hungry.
PPO’s clipped objective is elegant — it achieves trust-region-like stability with a simple min(..., clip(...)) operation instead of costly second-order optimization.
GAE with λ ≈ 0.95 gives you a practical bias-variance tradeoff for advantage estimation, and it’s used in virtually all modern PPO implementations.
For non-stationary environments (changing dynamics, curriculum learning, evolving populations), PPO’s on-policy approach is strongly preferred over DQN’s replay buffer.

References

Mnih, V. et al. (2015). “Human-level control through deep reinforcement learning.” Nature, 518(7540), 529–533.
Schulman, J. et al. (2017). “Proximal Policy Optimization Algorithms.” arXiv:1707.06347.
Schulman, J. et al. (2015). “High-Dimensional Continuous Control Using Generalized Advantage Estimation.” arXiv:1506.02438.
Sutton, R. S. & Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.). MIT Press.
Huang, S. et al. (2020). “A Closer Look at Invalid Action Masking in Policy Gradient Algorithms.” arXiv:2006.14171.
Raffin, A. et al. (2021). “Stable-Baselines3: Reliable Reinforcement Learning Implementations.” JMLR, 22(268), 1–8.

Dinanath Padhya

Explorer