Categorical DQN (C51)


C51 introduces a distributional perspective for DQN: instead of learning a single value for an action, C51 learns to predict a distribution of values for the action. Empirically, C51 demonstrates impressive performance in ALE.

Original papers:

Implemented Variants

Variants Implemented Description, docs For playing Atari games. It uses convolutional layers and common atari-based pre-processing techniques., docs For classic control tasks like CartPole-v1.

Below are our single-file implementations of C51:

The has the following features:

  • For playing Atari games. It uses convolutional layers and common atari-based pre-processing techniques.
  • Works with the Atari's pixel Box observation space of shape (210, 160, 3)
  • Works with the Discrete action space


poetry install -E atari
python cleanrl/ --env-id BreakoutNoFrameskip-v4
python cleanrl/ --env-id PongNoFrameskip-v4

Explanation of the logged metrics

Running python cleanrl/ will automatically record various metrics such as actor or value losses in Tensorboard. Below is the documentation for these metrics:

  • charts/episodic_return: episodic return of the game
  • charts/SPS: number of steps per second
  • losses/loss: the cross entropy loss between the \(t\) step state value distribution and the projected \(t+1\) step state value distribution
  • losses/q_values: implemented as (old_pmfs * q_network.atoms).sum(1), which is the sum of the probability of getting returns \(x\) (old_pmfs) multiplied by \(x\) (q_network.atoms), averaged over the sample obtained from the replay buffer; useful when gauging if under or over estimation happens

Implementation details is based on (Bellemare et al., 2017)1 but presents a few implementation differences:

  1. (Bellemare et al., 2017)1 injects stochaticity by doing "on each frame the environment rejects the agent’s selected action with probability \(p = 0.25\)", but does not do this
  2. use a self-contained evaluation scheme: reports the episodic returns obtained throughout training, whereas (Bellemare et al., 2017)1 is trained with --end-e=0.01 but reported episodic returns using a separate evaluation process with --end-e=0.001 (See "5.2. State-of-the-Art Results" on page 7).
  3. rescales the gradient so that the norm of the parameters does not exceed 0.5 like done in PPO ( ppo2/

Experiment results

PR vwxyzjn/cleanrl#159 tracks our effort to conduct experiments, and the reprodudction instructions can be found at vwxyzjn/cleanrl/benchmark/c51.

Below are the average episodic returns for

Environment 10M steps (Bellemare et al., 2017, Figure 14)1 50M steps (Hessel et al., 2017, Figure 5)3
BreakoutNoFrameskip-v4 467.00 ± 96.11 748 ~500 at 10M steps, ~600 at 50M steps
PongNoFrameskip-v4 19.32 ± 0.92 20.9 ~20 10M steps, ~20 at 50M steps
BeamRiderNoFrameskip-v4 9986.96 ± 1953.30 14,074 ~12000 10M steps, ~14000 at 50M steps

Note that we save computational time by reducing timesteps from 50M to 10M, but our scores the same or higher than (Mnih et al., 2015)1 in 10M steps.

Learning curves:

Tracked experiments and game play videos:

The has the following features:

  • Works with the Box observation space of low-level features
  • Works with the Discrete action space
  • Works with envs like CartPole-v1


python cleanrl/ --env-id CartPole-v1

Explanation of the logged metrics

See related docs for

Implementation details

The shares the same implementation details as except the runs with different hyperparameters and neural network architecture. Specifically,

  1. uses a simpler neural network as follows: = nn.Sequential(
        nn.Linear(np.array(env.single_observation_space.shape).prod(), 120),
        nn.Linear(120, 84),
        nn.Linear(84, env.single_action_space.n),
  2. runs with different hyperparameters:

    python --total-timesteps 500000 \
        --learning-rate 2.5e-4 \
        --buffer-size 10000 \
        --gamma 0.99 \
        --target-network-frequency 500 \
        --max-grad-norm 0.5 \
        --batch-size 128 \
        --start-e 1 \
        --end-e 0.05 \
        --exploration-fraction 0.5 \
        --learning-starts 10000 \
        --train-frequency 10

Experiment results

PR vwxyzjn/cleanrl#159 tracks our effort to conduct experiments, and the reprodudction instructions can be found at vwxyzjn/cleanrl/benchmark/c51.

Below are the average episodic returns for

CartPole-v1 498.51 ± 1.77
Acrobot-v1 -88.81 ± 8.86
MountainCar-v0 -167.71 ± 26.85

Note that the C51 has no official benchmark on classic control environments, so we did not include a comparison. That said, our was able to achieve near perfect scores in CartPole-v1 and Acrobot-v1; further, it can obtain successful runs in the sparse environment MountainCar-v0.

Learning curves:

Tracked experiments and game play videos:

