Deep Q-Networks (DQN) and Proximal Policy Optimization (PPO) are arguably the two most important algorithms in modern deep reinforcement learning. DQN brought RL into the deep learning era by playing Atari games at superhuman levels. PPO became the go-to algorithm for everything from robotic control to fine-tuning large language models (RLHF).

Yet most explanations either stay too abstract or drown in notation. This guide walks through both algorithms from the ground up — with intuition first, math second, and concrete examples throughout. By the end, you’ll understand why each design choice exists and when to pick one algorithm over the other.

We’ll use a real-world example to ground the concepts: a university timetabling system where an RL agent learns to select repair strategies for a genetic algorithm. The state is a 39-dimensional vector describing the current population, and the agent chooses from 6 discrete actions (different repair configurations). This is a great testbed because the environment is non-stationary — the population evolves every generation — which exposes the fundamental difference between DQN and PPO.


Table of Contents

  1. Background: The RL Problem Setup
  2. Deep Q-Network (DQN)
  3. Proximal Policy Optimization (PPO)
  4. DQN vs PPO — When to Use Which

1. Background: The RL Problem Setup

Both DQN and PPO solve the same fundamental problem: given an environment modeled as a Markov Decision Process (MDP), learn a policy that maximizes cumulative reward.

MDP tuple: ⟨S, A, P, R, γ⟩

SymbolMeaningExample (Timetabling)
SState space39-dimensional population snapshot ∈ [0,1]³⁹
AAction space6 discrete repair configurations {0, …, 5}
PTransition probabilitiesDetermined by the genetic algorithm + repair step
RReward functionImprovement in constraint violations
γDiscount factorTypically 0.99

The agent’s goal: Find a policy π(a|s) that maximizes the expected discounted return:

Key RL Concepts You Need First

Before diving into DQN and PPO, let’s establish the foundational concepts that both algorithms rely on.

Model-Free Reinforcement Learning

RL algorithms can be model-based (learn a model of the environment and plan with it) or model-free (learn directly from experience without building a model).

Both DQN and PPO are model-free. They don’t try to predict what happens when an action is taken — they just try actions, observe outcomes, and learn from those outcomes. This is crucial for complex environments where building an accurate model would be impractical.

Discount Factor (γ)

The discount factor γ (between 0 and 1) controls how much the agent cares about future rewards versus immediate ones.

  • γ = 0: Only care about immediate reward. Very short-sighted.
  • γ = 0.99: Care almost equally about near and distant future rewards. A reward 50 steps away is worth 0.99⁵⁰ ≈ 0.605 of its face value.
  • γ = 1: Care equally about all future rewards (can cause instability).

Setting γ = 0.99 makes the agent prioritize long-term outcomes — essential when early decisions affect later performance.

Temporal Difference (TD) Learning

Both algorithms use TD learning at their core. Instead of waiting until the end of an episode to learn (Monte Carlo), TD methods update value estimates after every single step:

The bracketed term is the TD error (δ):

If δ > 0, the outcome was better than expected. If δ < 0, it was worse. DQN uses TD learning to update Q-values; PPO uses TD errors to compute advantages via GAE.

Gradient Descent vs Gradient Ascent

Neural networks learn by adjusting weights to improve an objective. The direction depends on whether we’re minimizing a loss or maximizing a reward.

Gradient descent (minimizing):

DQN uses gradient descent. Its loss is the squared TD error — the network adjusts weights to make Q-value predictions more accurate.

Gradient ascent (maximizing):

PPO uses gradient ascent. Its objective J(θ) is the expected reward under the current policy. The policy gradient theorem gives us the gradient:

The intuition: for each action taken, check the advantage (was it better or worse than average?). If positive, increase that action’s probability. If negative, decrease it. The magnitude controls how much.

Value-Based vs Policy-Based Methods

This is the fundamental split between DQN and PPO.

Value-based (DQN): Learn a value function Q(s, a) that estimates how good each action is. The policy is derived indirectly — always pick the highest Q-value action.

State s → Q-network → Q(s, a₀)=2.1, Q(s, a₁)=5.3, Q(s, a₂)=3.7, ...
                       → Pick a₁ (highest Q)

Policy-based (PPO): Learn a policy π(a|s) that directly outputs action probabilities.

State s → Policy network → π(a₀|s)=0.05, π(a₁|s)=0.60, π(a₂|s)=0.15, ...
                            → Sample from this distribution

Value-based methods are deterministic (always pick max Q), which limits exploration. Policy-based methods are stochastic, allowing natural exploration and smoother learning.

On-Policy vs Off-Policy Learning

Off-policy (DQN): Can learn from experience collected by a different policy. DQN stores past transitions in a replay buffer and samples from them during training — even if those transitions came from an old, random policy.

On-policy (PPO): Can only learn from experience collected by its current policy. PPO collects fresh transitions, trains on them for a few epochs, then discards them. No replay buffer.

This distinction has huge practical consequences: off-policy is more data-efficient (reuses old experience), but on-policy avoids learning from stale data — crucial in non-stationary environments.

Summary

ConceptDQNPPO
Model-free?YesYes
What it learnsValue function Q(s,a)Policy π(a|s) + Value V(s)
Value-based or Policy-based?Value-basedPolicy-based (actor-critic)
On-policy or Off-policy?Off-policy (replay buffer)On-policy (fresh rollouts)
Exploration methodε-greedy (forced randomness)Entropy bonus (natural stochasticity)
Stability mechanismTarget networkClipped surrogate objective

2. Deep Q-Network (DQN)

Core idea: Learn a value function Q(s, a) that estimates the expected return of taking action a in state s, then always pick the action with the highest Q-value.

Reference: Mnih et al., “Human-level control through deep reinforcement learning” (Nature, 2015)

2.1 Q-Learning Foundation

Q-learning is based on the Bellman optimality equation — the recursive relationship between the value of a state-action pair and the values of successor states:

This says: “The true value of taking action a in state s equals the immediate reward plus the discounted value of the best action in the next state.”

Tabular Q-learning update:

where:

  • α = learning rate (step size)
  • r + γ max Q(s’, a’) = the TD target (what Q should be)
  • The bracketed term = TD error δ (the surprise signal)

Over many updates, Q(s, a) converges to the true expected return for each state-action pair.

2.2 From Q-Tables to Deep Q-Networks

With continuous state dimensions, a lookup table is impossible (infinite states). DQN uses a neural network Q_θ(s, a) parametrized by weights θ to approximate Q:

The network takes the state vector as input and outputs one Q-value per action.

Example architecture:

Input:   s ∈ ℝ³⁹  (state vector)
           ↓
Hidden:  Linear(39 → 64) → ReLU
           ↓
Hidden:  Linear(64 → 64) → ReLU
           ↓
Output:  Linear(64 → 6)  (Q-values for 6 actions)

Loss function: Mean squared TD error:

where θ⁻ are the target network parameters (explained next).

2.3 Experience Replay

Problem: Consecutive transitions (s₁→s₂→s₃→…) are highly correlated. Training on correlated data causes instability and catastrophic forgetting.

Solution: Store all transitions in a replay buffer of fixed capacity and sample random minibatches:

At each training step, sample a minibatch uniformly at random, compute TD targets, and update via gradient descent.

Why it works:

  • Breaks temporal correlations — random sampling decorrelates training data
  • Data efficiency — each transition gets reused many times
  • Smooth learning — averages over many past behaviors

2.4 Target Network

Problem: In the loss function, both the prediction Qθ(s, a) and the target r + γ max Qθ(s’, a’) depend on the same weights θ. When θ changes, the target moves — chasing a moving goalpost causes oscillation and divergence.

Solution: Maintain a separate target network Q_θ⁻ with frozen weights, updated only periodically:

Every C steps: θ⁻ ← θ (hard copy)

The target network provides stable learning targets for C steps at a time. The main network chases a target that only moves occasionally — not every gradient step.

2.5 ε-Greedy Exploration

Problem: If the agent always picks argmax Q(s, a), it never tries new actions and might miss better options.

Solution: With probability ε, take a random action; otherwise, take the greedy action:

ε starts high (1.0 = fully random) and decays to a small value (0.05 = mostly greedy):

2.6 Complete DQN Algorithm

ALGORITHM: Deep Q-Network (DQN)
─────────────────────────────────────────────────────────
Initialize Q-network Q_θ with random weights θ
Initialize target network Q_θ⁻ ← θ
Initialize replay buffer D with capacity N

For episode = 1, 2, ..., M:
    Reset environment → initial state s₀

    For t = 0, 1, ..., T_max - 1:
        // ACTION SELECTION (ε-greedy)
        With probability ε: a_t ← random action
        Otherwise:          a_t ← argmax_a Q_θ(s_t, a)

        // EXECUTE & STORE
        Execute a_t → observe r_t, s_{t+1}
        Store (s_t, a_t, r_t, s_{t+1}) in D

        // LEARN
        Sample random minibatch from D
        Compute targets: yⱼ = rⱼ + γ · max_{a'} Q_θ⁻(s'ⱼ, a')
        Update θ by minimizing: L = mean(yⱼ - Q_θ(sⱼ, aⱼ))²

        // TARGET & EXPLORATION UPDATES
        Every C steps: θ⁻ ← θ
        ε ← max(ε_min, ε - Δε)

DQN: Strengths and Weaknesses

Strengths:

  • Simple to implement and debug
  • Data-efficient through experience replay
  • Well-understood convergence properties

Weaknesses:

  • Only works with discrete action spaces
  • Replay buffer stores stale transitions — problematic in non-stationary environments
  • ε-greedy exploration is crude (random noise, not directed)
  • Can overestimate Q-values (ameliorated by Double DQN)

3. Proximal Policy Optimization (PPO)

Core idea: Directly optimize the policy π_θ(a|s) using gradient ascent on expected reward, but constrain updates to stay close to the current policy for stability.

Reference: Schulman et al., “Proximal Policy Optimization Algorithms” (2017)

3.1 Policy Gradient Foundation

Instead of learning a value function and deriving the policy, policy gradient methods directly parametrize and optimize the policy.

The policy gradient theorem (Sutton et al., 1999):

If action a_t led to high return G_t, increase its probability. If it led to low return, decrease it.

Problem with vanilla policy gradient: High variance (G_t can be very noisy), and large gradient steps can destroy the policy catastrophically.

3.2 The Trust Region Problem

A large policy update can ruin performance irreversibly:

Old policy:  π_old(action_A) = 0.4,  π_old(action_B) = 0.3, ...
             ↓ large gradient step ↓
New policy:  π_new(action_A) = 0.01, π_new(action_B) = 0.95, ...

This discards everything the policy learned. We need to constrain how much the policy can change per update.

TRPO solves this with a hard KL-divergence constraint, but requires expensive second-order optimization. PPO achieves the same goal with a simple first-order clipping trick.

3.3 PPO Clipped Surrogate Objective

Step 1 — Importance sampling ratio: How much did the policy change for this action?

  • r_t = 1.0: policies agree perfectly
  • r_t = 1.5: new policy is 50% more likely to take this action
  • r_t = 0.7: new policy is 30% less likely

Step 2 — Clipped surrogate:

where ε is the clip range (typically 0.2).

How clipping works:

Advantage Â_tRatio r_tWhat happensEffect
 > 0 (good action)r_t > 1+εProbability increasing too fastClips at 1+ε → limits increase
 > 0 (good action)r_t < 1+εProbability increasing moderatelyNo clip → normal gradient
 < 0 (bad action)r_t < 1−εProbability decreasing too fastClips at 1−ε → limits decrease
 < 0 (bad action)r_t > 1−εProbability decreasing moderatelyNo clip → normal gradient

Worked example: The agent took an action that turned out great (Â = +5.0). Old policy probability was 0.30, new policy pushed it to 0.55.

With ε = 0.2, the clip range is [0.8, 1.2]:

  • Unclipped: 1.833 × 5.0 = 9.17
  • Clipped: 1.2 × 5.0 = 6.0
  • min(9.17, 6.0) = 6.0 — the objective is capped, preventing an overly aggressive update.

3.4 Generalized Advantage Estimation (GAE)

The advantage Â_t measures how much better an action was compared to the average:

We don’t know Q exactly, so GAE provides a smooth bias-variance tradeoff using TD residuals:

GAE combines multiple TD residuals with exponential weighting:

The λ parameter controls the tradeoff:

λ valueEquivalent toBiasVarianceBehavior
λ = 01-step TDHighLowOnly immediate TD error
λ = 1Monte CarloLowHighFull episode return
λ = 0.95Blends ~20 stepsModerateModerateGood practical tradeoff

3.5 Actor-Critic Architecture

PPO uses an actor-critic network with a shared feature extractor and two heads:

Input: state vector s
           ↓
    ┌──────────────────────┐
    │   Shared Feature      │
    │   Extractor           │
    │   (MLP with ReLU)     │
    └───────┬───────┬───────┘
            ↓       ↓
    ┌───────┴──┐ ┌──┴───────┐
    │  ACTOR   │ │  CRITIC  │
    │  π(a|s)  │ │  V(s)    │
    │ Softmax  │ │  Scalar  │
    └──────────┘ └──────────┘

The Actor outputs action probabilities. The Critic outputs a state value estimate used to compute advantages. Sharing the feature extractor is efficient — understanding the environment state is useful for both selecting actions and predicting value.

3.6 Action Masking (Maskable PPO)

In many environments, certain actions are invalid in certain states. Maskable PPO handles this natively by setting masked action logits to −∞ before the softmax:

Since softmax(−∞) = 0, masked actions get exactly zero probability — no post-hoc renormalization needed.

3.7 Total PPO Loss Function

The total loss combines three terms:

Clipped surrogate (policy loss): Improves action selection within the trust region.

Value function loss: Trains the critic to accurately predict state values:

Entropy bonus: Prevents the policy from becoming deterministic too early:

High entropy = maximum exploration. Low entropy = committed exploitation. The coefficient c₂ controls the balance.

3.8 Complete PPO Algorithm

ALGORITHM: Proximal Policy Optimization (PPO)
─────────────────────────────────────────────────────────
Initialize actor-critic network with random weights θ
Set: ε (clip), γ (discount), λ (GAE), c₁, c₂, K (epochs), N (rollout steps)

For iteration = 1, 2, ...:

    // COLLECT ROLLOUTS (on-policy)
    For t = 0, ..., N-1:
        Sample a_t ~ π_θ(·|s_t)
        Execute a_t → observe r_t, s_{t+1}
        Store (s_t, a_t, r_t, V_θ(s_t), log π_θ(a_t|s_t))

    // COMPUTE ADVANTAGES (GAE, backward pass)
    For t = N-1, ..., 0:
        δ_t = r_t + γ·V(s_{t+1}) - V(s_t)
        Â_t = δ_t + (γλ)·Â_{t+1}
    Returns: R_t = Â_t + V(s_t)

    // OPTIMIZE (K epochs over same rollout)
    For epoch = 1, ..., K:
        For each minibatch:
            r_t(θ) = π_θ(a_t|s_t) / π_old(a_t|s_t)
            L_clip = min(r_t·Â_t, clip(r_t, 1-ε, 1+ε)·Â_t)
            L = -L_clip + c₁·(V_θ(s_t) - R_t)² - c₂·H[π_θ]
            θ ← θ - α·∇_θ L

    // DISCARD OLD DATA (on-policy — no replay buffer)
    θ_old ← θ

Critical difference from DQN: PPO discards all collected data after each update. It always trains on fresh rollouts from the current policy. This avoids the stale-data problem but is less data-efficient.


4. DQN vs PPO — When to Use Which

Algorithmic Comparison

AspectDQNPPO
What it learnsValue function Q(s,a)Policy π(a|s) + value V(s)
Policy typeDeterministic + ε-greedyStochastic (probability distribution)
Data usageOff-policy (replay buffer)On-policy (fresh rollouts)
Stability mechanismTarget networkClipped surrogate objective
Explorationε-greedy (random noise)Entropy bonus (natural stochasticity)
Action maskingPost-hoc, ad-hocNative (MaskedSoftmax)
Data efficiencyHigher (reuses transitions)Lower (discards after update)
Non-stationary envsPoor (stale replay data)Good (always fresh data)
Continuous actionsNot supportedSupported (Gaussian policy)

When DQN is Better

  • Stationary environments where old experience stays relevant (e.g., Atari games)
  • Data-scarce settings where you need to squeeze learning from every transition
  • Simple discrete action spaces with clear optimal actions
  • When simplicity matters — DQN is easier to implement and debug

When PPO is Better

  • Non-stationary environments where dynamics shift over time
  • Continuous or large discrete action spaces
  • When you need action masking for constraint satisfaction
  • Multi-task or curriculum learning where reward functions change
  • When stability is critical — PPO’s clipped updates are more predictable

The Non-Stationarity Problem (Why PPO Often Wins)

This is worth emphasizing because it’s the most common failure mode of DQN in practice:

DQN replay buffer problem:
  Step 10:  store (s, a, r=+50, s')  ← collected when dynamics were X
  Step 100: sample for training       ← dynamics are now Y
                                      ← old transition is misleading!

PPO on-policy solution:
  Step 100: collect fresh rollouts with current policy
            → advantages from current value estimates
            → no stale data contamination

In the timetabling example, the population changes every generation. A transition recorded when violations were at 1,400 is useless when they’re down to 70 — the dynamics are completely different. DQN’s replay buffer keeps learning from these stale transitions, while PPO always uses fresh data matching the current state.

The result: PPO achieved 7.8× higher mean reward and showed a clear upward learning trend, while DQN’s reward curve stayed flat after early convergence.


Key Takeaways

  1. DQN learns what actions are worth (Q-values). PPO learns what to do (policy probabilities). Both are valid; the choice depends on your problem.

  2. Experience replay makes DQN data-efficient but vulnerable to non-stationarity. On-policy collection keeps PPO data-fresh but data-hungry.

  3. PPO’s clipped objective is elegant — it achieves trust-region-like stability with a simple min(..., clip(...)) operation instead of costly second-order optimization.

  4. GAE with λ ≈ 0.95 gives you a practical bias-variance tradeoff for advantage estimation, and it’s used in virtually all modern PPO implementations.

  5. For non-stationary environments (changing dynamics, curriculum learning, evolving populations), PPO’s on-policy approach is strongly preferred over DQN’s replay buffer.


References

  1. Mnih, V. et al. (2015). “Human-level control through deep reinforcement learning.” Nature, 518(7540), 529–533.
  2. Schulman, J. et al. (2017). “Proximal Policy Optimization Algorithms.” arXiv:1707.06347.
  3. Schulman, J. et al. (2015). “High-Dimensional Continuous Control Using Generalized Advantage Estimation.” arXiv:1506.02438.
  4. Sutton, R. S. & Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.). MIT Press.
  5. Huang, S. et al. (2020). “A Closer Look at Invalid Action Masking in Policy Gradient Algorithms.” arXiv:2006.14171.
  6. Raffin, A. et al. (2021). “Stable-Baselines3: Reliable Reinforcement Learning Implementations.” JMLR, 22(268), 1–8.