ICRA 2026 MM-Spatial & SRRA Workshops

AgentGrounder
Zero-Shot 3D Visual Pointcloud Grounding
using Multimodal Language Models

Cuong Huynh Maxim Popov Denis Gridusov Sergey Kolyubin

Biomechatronics and Energy-Efficient Robotics (BE2R) Lab, ITMO University, Russia

Zero-shot 3D visual grounding on colored point clouds —
no task-specific 3D training required.

Abstract

3D Visual Grounding (3DVG) is an essential capability for embodied AI, requiring agents to localize objects in 3D scenes based on natural language descriptions. Recent zero-shot methods leverage 2D vision-language models (LVLMs). However, they often rely on existing sets of multi-view images and struggle with the limited semantic and spatial details provided by standard 3D segmentation tools.

We present AgentGrounder, a zero-shot 3D visual grounding framework that operates directly on colored point clouds without task-specific 3D training. Our approach follows a two-stage design: (1) an offline stage that applies a 3D model to build an Object Lookup Table (OLT) with instance IDs, semantic labels, and 3D bounding boxes; and (2) an online tool-driven agent that decomposes each query, retrieves only relevant candidates from the OLT, performs geometric scoring, and triggers image rendering on demand when additional visual evidence (e.g., color, material, or viewpoint-sensitive cues) is required.

Compared with fixed anchor-target matching pipelines, this design reduces cascading matching errors and improves context-window efficiency by avoiding prompts overloaded with irrelevant objects. We evaluate on ScanRefer and Nr3D under a zero-shot setting and observe consistent improvements over SeeGround in our setup, including +2.5% Acc@0.5 on ScanRefer and +6.3% on Nr3D, with a notable +6.3% gain on Nr3D view-independent queries. These results show that combining selective retrieval, geometric reasoning, and adaptive visual inspection yields a practical and robust foundation for open-vocabulary 3D grounding.

Method Overview

AgentGrounder follows a two-stage design that separates offline scene understanding from online query-driven reasoning.

Stage 1

Offline

A 3D instance segmentation network (Mask3D) processes the colored point cloud once per scene. Each detected instance is stored in an Object Lookup Table (OLT) with its unique ID, semantic label, 3D center coordinates, and bounding-box dimensions.

Instance-level masks & semantics
3D bounding boxes per object
Executed once per scene

Stage 2

Online

A vision-language model (Qwen3-VL-32B) acts as an agent that decomposes the language query, retrieves only the relevant candidates from the OLT, and applies deterministic geometric scoring to rank them.

Query decomposition & planning
Label-based candidate retrieval
Geometric scoring (distance, direction, size)
On-demand rendering for visual disambiguation

AgentGrounder pipeline: offline 3D segmentation builds an Object Lookup Table; online agent retrieves candidates, scores them geometrically, and optionally renders images for view-dependent disambiguation. — **Overview of the AgentGrounder agent pipeline.** For each query, the agent first writes an explicit plan, retrieves candidate objects by semantic labels from scene metadata, and computes geometric relations (e.g., nearest/farthest, left/right, below) from 3D centers and box sizes. For view-dependent expressions, the agent calls a rendering tool to inspect selected object IDs and resolve ambiguity. Finally, it returns a structured answer containing the predicted object ID and a textual justification.

Results

AgentGrounder consistently outperforms SeeGround across both benchmarks in our zero-shot setting.

ScanRefer (validation)

Method	Venue	Zero-Shot	Agent	Unique		Multiple		Overall
Method	Venue	Zero-Shot	Agent	Acc@0.25	Acc@0.5	Acc@0.25	Acc@0.5	Acc@0.25	Acc@0.5
Fully supervised
G3-LQ	CVPR'24	✗	—	88.6	73.3	50.2	39.7	56.0	44.7
MCLN	ECCV'24	✗	—	86.9	72.7	52.0	40.8	57.2	45.7
ConcreteNet	ECCV'24	✗	—	86.4	82.1	42.4	38.4	50.6	46.5
Chat-Scene	NeurIPS'24	✗	Vicuna-7B	89.6	82.5	47.8	42.9	55.5	50.2
Video-3D LLM	CVPR'25	✗	LLaVA-Video 7B	88.0	78.3	50.9	45.3	58.1	51.7
GPT4Scene	ICLR'26	✗	Qwen2-VL-7B	90.3	83.7	56.4	50.9	62.6	57.0
Zero-shot
LLM-G	ICRA'24	✓	GPT-3.5	—	—	—	—	14.3	4.7
LLM-G	ICRA'24	✓	GPT-4 turbo	—	—	—	—	17.1	5.3
ZSVG3D	CVPR'24	✓	GPT-4 turbo	63.8	58.4	27.7	24.6	36.4	32.7
SeeGround	CVPR'25	✓	Qwen2-VL-72B	75.7	68.9	34.0	30.0	44.1	39.4
CSVG	BMVC'25	✓	Mistral-Large-240B	68.8	61.1	38.4	27.3	49.6	39.8
AgentGrounder (Ours)	—	✓	Qwen3-VL-32B	80.3	73.7	36.6	31.7	47.2	41.9

Nr3D (validation)

Method	Easy	Hard	View Dep.	View Indep.	Overall
Fully supervised
MiKASA	69.7	59.4	65.4	64.0	64.4
ViL3DRel	70.2	57.4	62.0	64.5	64.4
SceneVerse	72.5	57.8	56.9	67.9	64.9
Zero-shot
ZSVG3D	46.5	31.7	36.8	40.0	39.0
VLM-Grounder	55.2	39.5	45.8	49.4	48.0
SeeGround	54.5	38.3	42.3	48.2	46.1
AgentGrounder (Ours)	59.6	45.4	47.9	54.5	52.4

Ablation Study

Ablation of agent tools on a held-out Nr3D subset (N = 120), measured by Acc@0.5.

#	Retrieval	Distance	Planning	Rendering	Easy	Hard	Dep.	Indep.	Overall
1	✓	✗	✗	✗	51.7	29.0	35.7	42.3	40.0
2	✓	✓	✗	✗	50.0	40.3	35.7	50.0	45.0
3	✓	✓	✓	✗	58.6	35.5	38.1	51.3	46.7
4	✓	✓	✗	✓	62.1	33.9	50.0	46.2	47.5
5	✓	✓	✓	✓	65.5	35.5	47.6	51.3	50.0

Qualitative comparison between SeeGround and AgentGrounder on example 3D scenes. — **Qualitative comparison.** Example predictions from SeeGround (left) and **AgentGrounder** (right) on ScanRefer and Nr3D scenes. Our method produces tighter bounding boxes and fewer false positives in cluttered scenes.

BibTeX

@article{huynh2026agentgrounder,
  title={AgentGrounder: Zero-Shot 3D Visual Pointcloud Grounding using Multimodal Language Models},
  author={Huynh, Cuong and Popov, Maxim and Gridusov, Denis and Kolyubin, Sergey},
  journal={arXiv preprint arXiv:2605.25901},
  year={2026}
}