Project

Model-Based Confidence-Aware Offline RL

reinforcement-learningoffline-rlpythonpytorchcontrol-systemsrobotics

Offline reinforcement learning framework combining DDPG-based policy optimization with model-based confidence estimation - applied to combustion control and marine vessel navigation.

Deploying RL in real systems - power plants, marine vessels, industrial controllers - means you can't run thousands of exploratory episodes. This project implements a model-based offline RL framework that learns a world model from logged data, then optimizes policy in simulation while tracking confidence bounds to avoid out-of-distribution extrapolation.

Architecture

The framework has three main components:

World Model (Simulator): A recurrent neural network (Simulator/simrnn_model.py) trained on historical operational data. For combustion control it predicts thermal output and emissions; for vessel navigation it predicts vessel state transitions.

Offline RL Policy (MORE): Implemented in RL/primal_dual_ddpg.py - a constrained DDPG variant that adds a primal-dual optimization layer to bound policy actions within the confident region of the world model. When the model's uncertainty estimate exceeds a threshold, it penalizes actions that would move into that region.

Co-teaching: Two models are trained simultaneously with different subsets of the data. Each model labels the other's uncertain samples, filtering noisy transitions before they corrupt policy gradients.

Setup

git clone https://github.com/pagand/ORL_optimizer
conda create -n orl python=3.10 && conda activate orl
pip install -r requirements.txt
pip install 'cython<3' scipy==1.12
# PyTorch (CUDA 11.8)
pip3 install torch --index-url https://download.pytorch.org/whl/cu118

Running

cd MBORL        # Model-based offline RL
cd CORL         # CORL baseline implementations
cd VesselModel  # Vessel navigation training + simulator
cd MORE         # MORE paper implementation

Applications

Two real-world domains were tested:

Combustion optimization - reducing CO2 and NOx emissions in thermal power plants while maintaining output targets (based on the DeepThermal dataset)
Marine vessel navigation - learning efficient vessel control policies from AIS tracking data without interacting with the physical vessel

The confidence-awareness is what makes this usable in practice: a policy that knows when it's extrapolating is safer to deploy than one that confidently acts outside its training distribution.

I write about this kind of work - reliability, uncertainty, building things that work in production. One email per month.