ACPER: Actor-Critic with Prioritised Experience Replay
Deep reinforcement learning implementation combining actor-critic policy gradient with prioritised experience replay for more sample-efficient training on continuous control tasks.
ACPER combines two orthogonal improvements to standard policy gradient methods: the variance-reduction of actor-critic architecture and the sample efficiency of Prioritised Experience Replay (PER). The result is a more stable training signal on continuous control environments where both data efficiency and gradient quality matter.
The Problem It Solves
Standard actor-critic (A2C, A3C) uses on-policy data — transitions collected by the current policy are used once and discarded. This is sample-wasteful. Off-policy methods like DQN reuse experience, but the standard uniform replay buffer treats all transitions equally, even when some provide much stronger learning signal.
PER weights transitions by their temporal-difference error: experiences where the current value estimate is most wrong get sampled more often. ACPER applies this prioritisation to the actor-critic framework.
Implementation
The critic learns a value function V(s) using prioritised replay; the actor uses the TD-error-weighted advantage estimates to update the policy:
# Sample batch with priority weights
batch, weights, idxs = replay_buffer.sample(batch_size)
# Compute TD error for critic update
td_error = r + γ * V(s') - V(s)
# Update priorities in replay buffer
replay_buffer.update_priorities(idxs, td_error.abs().detach())
# Weighted actor loss
actor_loss = -(weights * log_prob * advantage).mean()The IS (importance sampling) weights correct for the non-uniform sampling bias, ensuring the policy gradient estimate stays unbiased.
Notes
This was built as an experimental implementation to understand the interaction between on-policy gradient estimation and off-policy replay. The stability improvement over vanilla A2C on CartPole and LunarLander was measurable, though the benefit diminishes in environments where on-policy data collection is cheap.
I write about this kind of work — reliability, uncertainty, building things that work in production. One email per month.