DMFuser: Distilled Multi-Task Learning for Transformer-Based Multi-modal Fusion
P. Agand, et al. (2024). “DMFuser: Distilled Multi-Task Learning for Transformer-Based Multi-modal Fusion.” IROS.
Multitask learning framework for autonomous driving that fuses camera and LiDAR data via transformerbased architecture, with knowledge distillation to reduce de
DMFuser addresses a fundamental challenge in end-to-end autonomous driving: how to effectively fuse heterogeneous sensor inputs (camera images and LiDAR point clouds) while simultaneously optimizing for multiple driving tasks without catastrophic forgetting.
Problem
Standard multi-task learning for autonomous driving faces a tension: tasks like waypoint prediction, semantic segmentation, and object detection benefit from shared representations, but naive joint training often leads to one task dominating gradient updates at the expense of others. This gets worse with heterogeneous sensor modalities, where the optimal representation for RGB features differs from the optimal representation for 3D point cloud features.
Approach
DMFuser introduces a distilled multi-task transformer architecture with three key components:
Cross-modal attention: Transformer attention blocks that learn to align camera and LiDAR features at multiple spatial scales, producing a fused representation that preserves modality-specific information while enabling cross-modal reasoning.
Task-specific heads with shared backbone: A single transformer backbone processes the fused features, with lightweight task-specific decoders for each output. This limits parameter count while maintaining task performance.
Knowledge distillation from task experts: We train individual expert models for each task, then distill their knowledge into the multi-task student. This avoids the gradient conflict problem in joint training while retaining the inference efficiency of a single forward pass.
Results
On the CARLA autonomous driving benchmark, DMFuser achieves competitive driving scores while reducing model parameters by ~40% compared to task-specific baselines run in parallel. The distillation approach largely eliminates task interference observed in naive multi-task training.
Code
The full implementation is available at pagand/e2etransfuser, including training scripts, pre-trained weights, and CARLA evaluation code.
I write about this kind of work — reliability, uncertainty, building things that work in production. One email per month.