ICRA 2026 MM-Spatial & SRRA Workshops

AgentGrounder
Zero-Shot 3D Visual Pointcloud Grounding
using Multimodal Language Models

Cuong HuynhMaxim PopovDenis GridusovSergey Kolyubin

Biomechatronics and Energy-Efficient Robotics (BE2R) Lab, ITMO University, Russia

Zero-shot 3D visual grounding on colored point clouds —
no task-specific 3D training required.

Abstract

3D Visual Grounding (3DVG) is an essential capability for embodied AI, requiring agents to localize objects in 3D scenes based on natural language descriptions. Recent zero-shot methods leverage 2D vision-language models (LVLMs). However, they often rely on existing sets of multi-view images and struggle with the limited semantic and spatial details provided by standard 3D segmentation tools.

We present AgentGrounder, a zero-shot 3D visual grounding framework that operates directly on colored point clouds without task-specific 3D training. Our approach follows a two-stage design: (1) an offline stage that applies a 3D model to build an Object Lookup Table (OLT) with instance IDs, semantic labels, and 3D bounding boxes; and (2) an online tool-driven agent that decomposes each query, retrieves only relevant candidates from the OLT, performs geometric scoring, and triggers image rendering on demand when additional visual evidence (e.g., color, material, or viewpoint-sensitive cues) is required.

Compared with fixed anchor-target matching pipelines, this design reduces cascading matching errors and improves context-window efficiency by avoiding prompts overloaded with irrelevant objects. We evaluate on ScanRefer and Nr3D under a zero-shot setting and observe consistent improvements over SeeGround in our setup, including +2.5% Acc@0.5 on ScanRefer and +6.3% on Nr3D, with a notable +6.3% gain on Nr3D view-independent queries. These results show that combining selective retrieval, geometric reasoning, and adaptive visual inspection yields a practical and robust foundation for open-vocabulary 3D grounding.

Method Overview

AgentGrounder follows a two-stage design that separates offline scene understanding from online query-driven reasoning.

Stage 1

Offline

A 3D instance segmentation network (Mask3D) processes the colored point cloud once per scene. Each detected instance is stored in an Object Lookup Table (OLT) with its unique ID, semantic label, 3D center coordinates, and bounding-box dimensions.

  • Instance-level masks & semantics
  • 3D bounding boxes per object
  • Executed once per scene
Stage 2

Online

A vision-language model (Qwen3-VL-32B) acts as an agent that decomposes the language query, retrieves only the relevant candidates from the OLT, and applies deterministic geometric scoring to rank them.

  • Query decomposition & planning
  • Label-based candidate retrieval
  • Geometric scoring (distance, direction, size)
  • On-demand rendering for visual disambiguation
AgentGrounder pipeline: offline 3D segmentation builds an Object Lookup Table; online agent retrieves candidates, scores them geometrically, and optionally renders images for view-dependent disambiguation.
Overview of the AgentGrounder agent pipeline. For each query, the agent first writes an explicit plan, retrieves candidate objects by semantic labels from scene metadata, and computes geometric relations (e.g., nearest/farthest, left/right, below) from 3D centers and box sizes. For view-dependent expressions, the agent calls a rendering tool to inspect selected object IDs and resolve ambiguity. Finally, it returns a structured answer containing the predicted object ID and a textual justification.

Results

AgentGrounder consistently outperforms SeeGround across both benchmarks in our zero-shot setting.

ScanRefer (validation)

Method Venue Zero-Shot Agent Unique Multiple Overall
Acc@0.25 Acc@0.5 Acc@0.25 Acc@0.5 Acc@0.25 Acc@0.5
Fully supervised
G3-LQCVPR'2488.673.350.239.756.044.7
MCLNECCV'2486.972.752.040.857.245.7
ConcreteNetECCV'2486.482.142.438.450.646.5
Chat-SceneNeurIPS'24Vicuna-7B89.682.547.842.955.550.2
Video-3D LLMCVPR'25LLaVA-Video 7B88.078.350.945.358.151.7
GPT4SceneICLR'26Qwen2-VL-7B90.383.756.450.962.657.0
Zero-shot
LLM-GICRA'24GPT-3.514.34.7
LLM-GICRA'24GPT-4 turbo17.15.3
ZSVG3DCVPR'24GPT-4 turbo63.858.427.724.636.432.7
SeeGroundCVPR'25Qwen2-VL-72B75.768.934.030.044.139.4
CSVGBMVC'25Mistral-Large-240B68.861.138.427.349.639.8
AgentGrounder (Ours)Qwen3-VL-32B80.373.736.631.747.241.9

Nr3D (validation)

Method Easy Hard View Dep. View Indep. Overall
Fully supervised
MiKASA69.759.465.464.064.4
ViL3DRel70.257.462.064.564.4
SceneVerse72.557.856.967.964.9
Zero-shot
ZSVG3D46.531.736.840.039.0
VLM-Grounder55.239.545.849.448.0
SeeGround54.538.342.348.246.1
AgentGrounder (Ours)59.645.447.954.552.4

Ablation Study

Ablation of agent tools on a held-out Nr3D subset (N = 120), measured by Acc@0.5.

# Retrieval Distance Planning Rendering Easy Hard Dep. Indep. Overall
151.729.035.742.340.0
250.040.335.750.045.0
358.635.538.151.346.7
462.133.950.046.247.5
565.535.547.651.350.0
Qualitative comparison between SeeGround and AgentGrounder on example 3D scenes.
Qualitative comparison. Example predictions from SeeGround (left) and AgentGrounder (right) on ScanRefer and Nr3D scenes. Our method produces tighter bounding boxes and fewer false positives in cluttered scenes.

BibTeX

@article{huynh2026agentgrounder,
  title={AgentGrounder: Zero-Shot 3D Visual Pointcloud Grounding using Multimodal Language Models},
  author={Huynh, Cuong and Popov, Maxim and Gridusov, Denis and Kolyubin, Sergey},
  journal={arXiv preprint arXiv:2605.25901},
  year={2026}
}