ICRA 2026 Workshop on Semantics for Reliable Robot Autonomy

R5DGS: Semantic-Aware 4D Gaussian Splatting
with Rigid Body Constraints for Efficient Dynamic Scene Reconstruction

Denis Gridusov, Maxim Popov, Sergey Kolyubin BE2R Lab, ITMO University

Scene examples from the Dynamic Indoor Scene dataset with R5DGS reconstructions across Dining Table, Chessboard, Darkroom, and Factory scenes
R5DGS reconstruction across four Dynamic Indoor Scene environments. Each column shows a different scene with R5DGS rendered RGB (top) and semantic segmentation (bottom).

Abstract

We present R5DGS, a framework that extends physics-informed 4D Gaussian Splatting with instance-level semantics and rigid-body constrained extrapolation for efficient dynamic scene reconstruction. Our method augments each Gaussian with a compact learnable identity vector, enabling discrete object grouping without 3D annotations.

By restricting dynamics prediction to representative Gaussians per object (typically 9-12) rather than the full set (~40,000), we achieve a consistent ~11 FPS speedup during extrapolation on NVIDIA RTX 4090 while preserving physically plausible motion trajectories.

Additionally, we construct an offline CLIP-based lookup table that enables open-vocabulary object retrieval from natural language prompts, supporting selective rendering and scene editing without retraining.

Key Results

FPS Speedup +11 FPS
Inference Complexity O(N) → O(K)
Representatives / Scene 9-12
mIoU (Overall) 0.59
Open-Vocab. Retrieval CLIP-based

Method

R5DGS builds on TRACE with identity-augmented Gaussians, rigid-body constrained extrapolation, and a CLIP-based open-vocabulary retrieval pipeline.

R5DGS pipeline: multi-view video to identity-augmented 4D Gaussians through rigid-body extrapolation
R5DGS pipeline overview. Multi-view RGB videos are represented as identity-augmented 4D Gaussians. Identity encoding guided by SAM2+DEVA semantic masks. Rigid-body extrapolation propagates dynamics from representative Gaussians (9-12 per object) to the full set, reducing inference cost from O(N) to O(K).
C1

Identity-Augmented Gaussians

Each 3D Gaussian is augmented with a learnable 16-dimensional identity vector that is alpha-blended during rendering. A lightweight classifier supervised by SAM2+DEVA masks enables discrete object grouping without any 3D annotations.

semantic grouping
C2

Rigid-Body Extrapolation

At inference, dynamics are computed only for representative Gaussians at each object's geometric centroid. Motion is rigidly propagated to all other Gaussians via translation and rotation of precomputed canonical offsets, preserving inter-point distances.

~11 FPS speedup
C3

Open-Vocabulary Querying

An offline CLIP-based lookup table stores text-aligned embeddings for each object group. Natural language prompts retrieve Gaussian subsets via cosine similarity, enabling selective rendering and scene editing without retraining.

zero-shot retrieval

Results

Quantitative and qualitative evaluation on the Dynamic Indoor Scene dataset.

Video demo — coming soon

Qualitative Results

Open-vocabulary object retrieval and semantic segmentation with R5DGS
R5DGS enables open-vocabulary object retrieval from natural language prompts via CLIP-based embedding lookup, supporting selective rendering and scene editing without retraining.

Reconstruction Quality & Speed

Novel View Synthesis across 4 scenes (extrapolation)

Method Dining Table Chessboard Darkroom Factory
PSNR↑SSIM↑LPIPS↓ PSNR↑SSIM↑LPIPS↓ PSNR↑SSIM↑LPIPS↓ PSNR↑SSIM↑LPIPS↓
TRACE 35.5800.9620.050 34.6300.9630.055 37.7740.9610.067 36.4880.9650.049
5DGS (Ours) 35.4280.9560.055 33.9910.9560.063 36.6000.9550.074 35.9260.9580.055
R5DGS (Ours) 28.8440.9420.066 28.8050.9320.086 31.1810.9390.091 29.7980.9240.075
R5DGS w/ extra loss (Ours) 28.6880.9390.067 29.1530.9290.089 31.5370.9430.087 30.7490.9280.073

Segmentation Accuracy & Frame Rate per scene

Method Dining Table Chessboard Darkroom Factory Overall
FPS↑mIoU↑ FPS↑mIoU↑ FPS↑mIoU↑ FPS↑mIoU↑ FPS↑mIoU↑
5DGS 66.90.78 67.30.75 49.40.37 64.90.47 62.10.59
R5DGS 76.30.77 76.90.73 66.20.38 75.00.46 73.60.59

Metrics per scene on the Dynamic Indoor Scene dataset. FPS measured on NVIDIA RTX 4090. R5DGS achieves a consistent ~11 FPS speedup over 5DGS across all scenes while maintaining mIoU.

BibTeX

@article{gridusov2026r5dgs,
  title={R5DGS: Semantic-Aware 4D Gaussian Splatting with Rigid Body Constraints for Efficient Dynamic Scene Reconstruction},
  author={Gridusov, Denis and Popov, Maxim and Kolyubin, Sergey},
  journal={arXiv preprint arXiv:2605.25909},
  year={2026}
}