Publication

Beyond Static Datasets: Robust Offline Policy Optimization via Vetted Synthetic Transitions

arXiv preprint2026

P. Agand, M. Chen (2026). “Beyond Static Datasets: Robust Offline Policy Optimization via Vetted Synthetic Transitions.” arXiv preprint.

arXiv ↗

Offline RLReinforcement LearningPolicy OptimizationDistributional Robustness

Offline reinforcement learning framework that addresses distributional shift between training data and deployment environments, enabling robust policy learning…

This work addresses a core limitation of offline reinforcement learning: policies trained on static datasets tend to fail when deployed in environments that differ from the data distribution - even slightly. We develop a framework for robust offline policy optimization that explicitly accounts for this distributional shift.

The Static Dataset Problem

Offline RL enables learning control policies from historical data without environment interaction - valuable when online exploration is expensive or dangerous (healthcare, robotics, finance). The challenge: the training dataset was collected by some behavior policy under some environmental conditions. The deployment environment may differ in ways that weren't represented in the data.

Standard offline RL methods (CQL, IQL, TD3+BC) handle this by penalizing out-of-distribution actions - staying close to the behavior policy. This works when the deployment environment matches training, but produces overly conservative policies when there's distributional shift.

Approach

We frame robust offline policy optimization as a minimax problem: learn a policy that performs well against a set of possible test environments, not just the training environment.

Uncertainty set construction: We use the offline dataset to estimate a perturbation set around the training distribution - the set of environments "close to" what was observed, where "close" is measured by distributional divergence.

Robust Bellman operator: We replace the standard Bellman update with a robust version that considers the worst-case environment within the uncertainty set, encouraging policies that maintain performance under perturbation.

Conservative data augmentation: We augment the offline dataset with synthetic transitions from perturbed environments, improving coverage of the distributional shift scenarios the robust objective is designed to handle.

Paper

Available as arXiv preprint 2601.18107. This work extends research begun during my internship at Borealis AI and connects to the broader theme of building AI systems that are honest about the limits of their training data - a theme that runs through my robotics and LLM reliability work alike.

I write about this kind of work - reliability, uncertainty, building things that work in production. One email per month.