Project

ACPER: Actor-Critic with Prioritised Experience Replay

reinforcement-learningdeep-learningpythonpytorchpolicy-gradient

Deep reinforcement learning implementation combining actor-critic policy gradient with prioritised experience replay for more sample-efficient training on continuous control tasks.

ACPER combines two orthogonal improvements to standard policy gradient methods: the variance-reduction of actor-critic architecture and the sample efficiency of Prioritised Experience Replay (PER). The result is a more stable training signal on continuous control environments where both data efficiency and gradient quality matter.

The Problem It Solves

Standard actor-critic (A2C, A3C) uses on-policy data - transitions collected by the current policy are used once and discarded. This is sample-wasteful. Off-policy methods like DQN reuse experience, but the standard uniform replay buffer treats all transitions equally, even when some provide much stronger learning signal.

PER weights transitions by their temporal-difference error: experiences where the current value estimate is most wrong get sampled more often. ACPER applies this prioritisation to the actor-critic framework.

Implementation

The critic learns a value function V(s) using prioritised replay; the actor uses the TD-error-weighted advantage estimates to update the policy:

# Sample batch with priority weights
batch, weights, idxs = replay_buffer.sample(batch_size)

# Compute TD error for critic update
td_error = r + γ * V(s') - V(s)

# Update priorities in replay buffer
replay_buffer.update_priorities(idxs, td_error.abs().detach())

# Weighted actor loss
actor_loss = -(weights * log_prob * advantage).mean()

The IS (importance sampling) weights correct for the non-uniform sampling bias, ensuring the policy gradient estimate stays unbiased.

Notes

This was built as an experimental implementation to understand the interaction between on-policy gradient estimation and off-policy replay. The stability improvement over vanilla A2C on CartPole and LunarLander was measurable, though the benefit diminishes in environments where on-policy data collection is cheap.

I write about this kind of work - reliability, uncertainty, building things that work in production. One email per month.