Online Semantic SLAM  · 

RADIO-ViPE

Online Tightly Coupled Multi-Modal Fusion for
Open-Vocabulary Semantic SLAM in Dynamic Environments

¹ Biomechatronics and Energy-Efficient Robotics (BE2R) Lab, ITMO University, Saint Petersburg, Russia
* Equal contribution  ·  † Corresponding author
RADIO-ViPE Teaser
RADIO-ViPE — real-time open-vocabulary semantic grounding from uncalibrated monocular RGB, running online at ~8–10 FPS
We present RADIO-ViPE (Reduce All Domains Into One — Video Pose Engine), an online semantic SLAM system that enables geometry-aware open-vocabulary grounding, associating arbitrary natural language queries with localized 3D regions and objects in dynamic environments. Unlike existing approaches that require calibrated, posed RGB-D input, RADIO-ViPE operates directly on raw monocular RGB video streams, requiring no prior camera intrinsics, depth sensors, or pose initialization. The system tightly couples multi-modal embeddings — spanning vision and language — derived from agglomerative foundation models (e.g., RADIO) with geometric scene information. This vision-language-geometric fusion is optimized within adaptive robust kernels, designed to handle both actively moving objects and agent-displaced scene elements (e.g., furniture rearranged during an ego-centric session). Experiments demonstrate that RADIO-ViPE achieves state-of-the-art results on the dynamic TUM-RGBD benchmark while maintaining competitive performance against offline open-vocabulary methods that rely on calibrated data and static scene assumptions. RADIO-ViPE bridges a critical gap in real-world deployment, enabling robust open-vocabulary semantic grounding for autonomous robotics, AR/VR applications, and unconstrained in-the-wild video streams.

What RADIO-ViPE Introduces

01
Vision-Language-Geometric Fusion
A novel tightly coupled multi-modal fusion that jointly embeds high-level features from agglomerative foundation models with geometric constraints directly within a dense bundle adjustment framework.
02
Temporally Consistent Adaptive Robust Kernel
A dynamic-aware robust optimization scheme extending adaptive kernel formulations with temporal consistency, reasoning jointly over geometric reprojection error and cross-view semantic embedding discrepancy.
03
Calibration-Free Online Open-Vocab SLAM
An online, ready-to-deploy semantic SLAM system that unifies vision, language, and geometry from uncalibrated monocular RGB — no depth sensors, pose priors, or category supervision required.

System Overview

RADIO-ViPE Pipeline Overview
Fig. 2 — RADIO-ViPE pipeline: GeoCalib → RADSeg → DroidSLAM flow → Dense Tightly Coupled Bundle Adjustment → Open-Vocabulary Grounding
STEP 01
Camera Init
Intrinsics bootstrapped from sampled frames via GeoCalib — no calibration targets required.
STEP 02
Keyframe Selection
Relative motion estimated via weighted dense optical flow; frames above motion threshold become keyframes.
STEP 03
Feature Extraction
Dense multi-modal embeddings from RADSeg; compressed to K=256 dims via PCA on encoder space.
STEP 04
Depth Estimation
Metric depth per keyframe from foundation models; converted to inverse depth for numerical stability.
STEP 05
Bundle Adjustment
Joint refinement of poses, disparities, and intrinsics via vision-language-geometric energy minimization.
STEP 06
OV Grounding
3D points decoded and projected into SigLIP latent space; matched against free-form text query embeddings.
Adaptive Robust Kernels
Fig. 3 — Adaptive robust kernels based on Barron's general loss. Shape parameter α transitions from ℓ₂ (static surfaces) → Huber (movable objects) → Cauchy (actively moving agents), governed by the temporal stability field S(u).

Quantitative Results

Method Walking Sequences Sitting Sequences Avg ↓
Method fr3/w/xyz fr3/w/rpy fr3/w/hs fr3/w/static fr3/s/xyz fr3/s/rpy fr3/s/hs fr3/s/static Avg ↓
Dyna-SLAM 1.643.542.960.68 1.271.86 2.00
DLD-SLAM 1.854.242.190.56 2.21
V3D-SLAM 1.537.812.290.65 0.871.691.470.58 2.10
DGS-SLAM 4.105.500.60 4.40 3.65
RoDyn-SLAM 8.305.601.70 2.70 4.58
DynaMON 1.43.92.01.4 0.92.11.90.5 1.76
ViPE (SAM) 2.356.2710.830.51 5.413.303.530.53 4.10
RADIO-ViPE 1.903.503.100.55 1.152.721.600.53 1.90
RADIO-ViPEark 1.553.391.960.50 0.982.651.440.56 1.63
Method Without Background With Background Online Calib-Free Depth-Free Pose-Free
Method mIoU % f-mIoU % Acc % mIoU % f-mIoU % Acc % Online Calib-Free Depth-Free Pose-Free
ConceptFusion 21.0731.5135.65 20.3835.7541.58
ConceptGraphs 11.6316.6119.80 11.7221.3528.28
HOV-SG 16.9331.4534.74 19.2930.6435.17
NACLIP-3D 20.3735.0847.47 15.3016.9826.23
Trident-3D 21.3043.3454.79 20.6338.5350.31
RayFronts 39.3762.0368.80 27.7343.3754.45
RADIO-ViPEark 24.2550.6359.25 19.0037.1348.38
Method Online Capability System Properties
Method Online Semantics Grounding Odometry Mapping Dynamic Calib-Free
ORB-SLAM3
RVWO
Kimera Closed
SamSLAM Agnostic
RGBDSLAM Closed
BBQ Open
ConceptGraphs Open
HOV-SG Open
OpenScene Open
Openmask3d Open
CLIO Task-driven
OVO-SLAM Open
RayFronts Open
MASt3R-SLAM
VGGT
DUST3R
ViPE Predefined
RADIO-ViPE (Ours) Open

BibTeX

radio_vipe.bib
@misc{nasser2026radiovipe,
  title     = {RADIO-ViPE: Online Tightly Coupled Multi-Modal Fusion
               for Open-Vocabulary Semantic SLAM in Dynamic Environments},
  author    = {Zaid Nasser and Mikhail Iumanov and Tianhao Li and
               Maxim Popov and Jaafar Mahmoud and Sergey Kolyubin},
  year      = {2026},
  institution = {BE2R Lab, ITMO University},
  note      = {Project page: https://github.com/be2rlab/RADIO-ViPE}
}