RADIO-ViPE: Online Tightly Coupled Multi-Modal Fusion for Open-Vocabulary Semantic SLAM in Dynamic Environments

Nasser, Zaid; Iumanov, Mikhail; Li, Tianhao; Popov, Maxim; Mahmoud, Jaafar; Kolyubin, Sergey

Abstract

We present RADIO-ViPE (Reduce All Domains Into One — Video Pose Engine), an online semantic SLAM system that enables geometry-aware open-vocabulary grounding, associating arbitrary natural language queries with localized 3D regions and objects in dynamic environments. Unlike existing approaches that require calibrated, posed RGB-D input, RADIO-ViPE operates directly on raw monocular RGB video streams, requiring no prior camera intrinsics, depth sensors, or pose initialization. The system tightly couples multi-modal embeddings — spanning vision and language — derived from agglomerative foundation models (e.g., RADIO) with geometric scene information. This vision-language-geometric fusion is optimized within adaptive robust kernels, designed to handle both actively moving objects and agent-displaced scene elements (e.g., furniture rearranged during an ego-centric session). Experiments demonstrate that RADIO-ViPE achieves state-of-the-art results on the dynamic TUM-RGBD benchmark while maintaining competitive performance against offline open-vocabulary methods that rely on calibrated data and static scene assumptions. RADIO-ViPE bridges a critical gap in real-world deployment, enabling robust open-vocabulary semantic grounding for autonomous robotics, AR/VR applications, and unconstrained in-the-wild video streams.

Contributions

What RADIO-ViPE Introduces

Vision-Language-Geometric Fusion

A novel tightly coupled multi-modal fusion that jointly embeds high-level features from agglomerative foundation models with geometric constraints directly within a dense bundle adjustment framework.

Temporally Consistent Adaptive Robust Kernel

A dynamic-aware robust optimization scheme extending adaptive kernel formulations with temporal consistency, reasoning jointly over geometric reprojection error and cross-view semantic embedding discrepancy.

Calibration-Free Online Open-Vocab SLAM

An online, ready-to-deploy semantic SLAM system that unifies vision, language, and geometry from uncalibrated monocular RGB — no depth sensors, pose priors, or category supervision required.

Pipeline

System Overview

Fig. 2 — RADIO-ViPE pipeline: GeoCalib → RADSeg → DroidSLAM flow → Dense Tightly Coupled Bundle Adjustment → Open-Vocabulary Grounding

STEP 01

Camera Init

Intrinsics bootstrapped from sampled frames via GeoCalib — no calibration targets required.

STEP 02

Keyframe Selection

Relative motion estimated via weighted dense optical flow; frames above motion threshold become keyframes.

STEP 03

Feature Extraction

Dense multi-modal embeddings from RADSeg; compressed to K=256 dims via PCA on encoder space.

STEP 04

Depth Estimation

Metric depth per keyframe from foundation models; converted to inverse depth for numerical stability.

STEP 05

Bundle Adjustment

Joint refinement of poses, disparities, and intrinsics via vision-language-geometric energy minimization.

STEP 06

OV Grounding

3D points decoded and projected into SigLIP latent space; matched against free-form text query embeddings.

Fig. 3 — Adaptive robust kernels based on Barron's general loss. Shape parameter α transitions from ℓ₂ (static surfaces) → Huber (movable objects) → Cauchy (actively moving agents), governed by the temporal stability field S(u).

Experiments

Quantitative Results

Method	Walking Sequences				Sitting Sequences				Avg ↓
Method	fr3/w/xyz	fr3/w/rpy	fr3/w/hs	fr3/w/static	fr3/s/xyz	fr3/s/rpy	fr3/s/hs	fr3/s/static	Avg ↓
Dyna-SLAM	1.64	3.54	2.96	0.68	1.27	—	1.86	—	2.00
DLD-SLAM	1.85	4.24	2.19	0.56	—	—	—	—	2.21
V3D-SLAM	1.53	7.81	2.29	0.65	0.87	1.69	1.47	0.58	2.10
DGS-SLAM	4.10	—	5.50	0.60	—	—	4.40	—	3.65
RoDyn-SLAM	8.30	—	5.60	1.70	—	—	2.70	—	4.58
DynaMON	1.4	3.9	2.0	1.4	0.9	2.1	1.9	0.5	1.76
ViPE (SAM)	2.35	6.27	10.83	0.51	5.41	3.30	3.53	0.53	4.10
RADIO-ViPE	1.90	3.50	3.10	0.55	1.15	2.72	1.60	0.53	1.90
RADIO-ViPE_ark	1.55	3.39	1.96	0.50	0.98	2.65	1.44	0.56	1.63

Method	Without Background			With Background			Online	Calib-Free	Depth-Free	Pose-Free
Method	mIoU %	f-mIoU %	Acc %	mIoU %	f-mIoU %	Acc %	Online	Calib-Free	Depth-Free	Pose-Free
ConceptFusion	21.07	31.51	35.65	20.38	35.75	41.58	✕	✕	✕	✕
ConceptGraphs	11.63	16.61	19.80	11.72	21.35	28.28	✕	✕	✕	✕
HOV-SG	16.93	31.45	34.74	19.29	30.64	35.17	✕	✕	✕	✕
NACLIP-3D	20.37	35.08	47.47	15.30	16.98	26.23	✕	✕	✕	✕
Trident-3D	21.30	43.34	54.79	20.63	38.53	50.31	✕	✕	✕	✕
RayFronts	39.37	62.03	68.80	27.73	43.37	54.45	✓	✕	✕	✕
RADIO-ViPE_ark	24.25	50.63	59.25	19.00	37.13	48.38	✓	✓	✓	✓

Method	Online Capability			System Properties
Method	Online	Semantics	Grounding	Odometry	Mapping	Dynamic	Calib-Free
ORB-SLAM3	✓	✕	✕	✓	✓	✕	✕
RVWO	✓	✕	✕	✓	✓	✓	✕
Kimera	✓	Closed	✕	✓	✓	✕	✕
SamSLAM	✓	Agnostic	✕	✓	✓	✓	✕
RGBDSLAM	✓	Closed	✕	✓	✓	✕	✕
BBQ	✕	Open	✓	✕	✓	✕	✕
ConceptGraphs	✕	Open	✓	✕	✓	✕	✕
HOV-SG	✕	Open	✓	✕	✓	✕	✕
OpenScene	✕	Open	✓	✕	✕	✕	✕
Openmask3d	✕	Open	✓	✕	✕	✕	✕
CLIO	✓	Task-driven	✓	✓	✓	✕	✕
OVO-SLAM	✓	Open	✓	✓	✓	✕	✕
RayFronts	✓	Open	✓	✕	✓	✕	✕
MASt3R-SLAM	✓	✕	✕	✓	✓	✕	✓
VGGT	✓	✕	✕	✓	✓	✕	✓
DUST3R	✓	✕	✕	✓	✓	✓	✓
ViPE	✓	Predefined	✕	✓	✓	✕	✓
RADIO-ViPE (Ours)	✓	Open	✓	✓	✓	✓	✓

RADIO-ViPE

What RADIO-ViPE Introduces

System Overview

Quantitative Results

Open-Vocabulary Grounding Visualizations

BibTeX