OSMa-Bench

OSMa-Bench:
Evaluating Open Semantic Mapping Under Varying Lighting Conditions

BE2R Lab, ITMO University

Abstract

Open Semantic Mapping (OSM) is a key technology in robotic perception, combining semantic segmentation and SLAM techniques. This paper introduces a dynamically configurable and highly automated LLM/LVLM-powered pipeline for evaluating OSM solutions called OSMa-Bench (Open Semantic Mapping Benchmark). The study focuses on evaluating state-of-the-art semantic mapping algorithms under varying indoor lighting conditions, a critical challenge in indoor environments. We introduce a novel dataset with simulated RGB-D sequences and ground truth 3D reconstructions, facilitating the rigorous analysis of mapping performance across different lighting conditions. Through experiments on leading models such as ConceptGraphs, BBQ and OpenScene, we evaluate the semantic fidelity of object recognition and segmentation. Additionally, we introduce a Scene Graph evaluation method to analyze the ability of models to interpret semantic structure. The results provide insights into the robustness of these models, forming future research directions for developing resilient and adaptable robotic systems.

Data Preparation

We assigned sets of 22 and 8 scenes for ReplicaCAD and HM3D datasets, respectively, distinct in test conditions' configurations:

"Baseline": Uses static, non-uniformly distributed light sources available as a default scenario for the ReplicaCAD dataset only;
"Dynamic lighting": Corresponds to changing light conditions along the robot's path (ReplicaCAD only);
"Nominal lights": Relies on the mesh itself emitting light without any added light sources;
"Camera light": Introduces an extra directed light source attached to the camera.

Also for both datasets we applied "Velocity" modification meaning we recorded sequences at doubled velocity (relative to the "Baseline" for ReplicaCAD and the "Nominal lights" for HM3D).

We expanded the semantic description of the Replica CAD dataset. This made it possible to consider both classes describing parts of the apartment (for example, wall, floor, stairs) and classes describing furniture and household utensils during testing.

Visual Question Answering

We utilize LLM and LVLM to get scene frames descriptions and construct a set of questions and corresponding ground truth answers, which are additionally validated and balanced in order to avoid usage of ambiguous questions for the further evaluation of the scene graph.

We sample frames along the movement within a previously generated test scene. For each key frame, we employ LVLM to generate scene descriptions and further construct a set of questions each targeting specific aspects of scene understanding:

Binary General – Yes/No questions about the presence of objects and general scene characteristics (e.g., Is there a blue sofa?).
Binary Existence-Based – Yes/No questions designed to track false positives by querying non-existent objects (e.g., Is there a piano?).
Binary Logical – Yes/No questions with logical operators such as AND/OR (e.g., Is there a chair AND a table?).
Measurement – Questions requiring numerical answers related to object counts or scene attributes (e.g., How many windows are present?).
Object Attributes – Queries about object properties, including color, shape, and material (e.g., What color is the door?).
Object Relations - Functional – Questions about functional relationships between objects (e.g., Which object supports the table?).
Object Relations - Spatial – Queries about spatial placement of objects within the scene (e.g., What is in front of the staircase?).
Comparison – Questions that compare object properties such as size, color, and position (e.g., Which is taller: the bookshelf or the lamp?).

Experiments

Semantic Segmentation

We assigned sets of 22 and 8 trajectories for ReplicaCAD and HM3D datasets respectively distinct in test conditions' configurations.

The evaluation was conducted for three methods: ConceptGraphs, OpenScene, and a very recent BBQ. These methods employ different approaches, so we can perform benchmarking on a wider spectrum.
Obtained results for segmentation quality metrics for all considered methods and scenes are presented in Tables I-II, while its relative change (degradation) under different test conditions are illustrated on a ReplicaCAD dataset (less means less robust).

Semantic Graph Evaluation

The VQA pipeline generated on average 184 and 76 questions for each scene, which gives us in total 4055 and 611 questions for ReplicaCAD and HM3D respectively.

The average ratio between different questions' categories is the following: 18.6% for Binary General, 16.6% for Binary Existence-Based, 18.4% for Binary Logical, 5.2% for Measurement, 17.0% for Object Attributes, 0.8% for Object Relations - Functional, 18.7% for Object Relations - Spatial, 4.7% for Comparison. Functional Relationships were challenging for LLM to interpret correctly, often leading to inconsistent or ambiguous answers. As a result, many of these questions were removed during the validation process, leaving only a small proportion in the final datasets. Due to the low number of these questions, their inclusion would make comparisons with other categories unreliable. Thus, we exclude them from evaluation.

@article{popov2025osmabench title={OSMa-Bench: Evaluating Open Semantic Mapping Under Varying Lighting Conditions}, author={Popov, Maxim and Kurkova, Regina and Iumanov, Mikhail and Mahmoud, Jaafar and Kolyubin, Sergey}, journal={arXiv preprint arXiv:2503.10331}, year={2025} }

OSMa-Bench:
Evaluating Open Semantic Mapping Under Varying Lighting Conditions

Our work evaluates open semantic mapping quality providing an automated LLM/LVLM-based alternative to human assessment. We generate test sequences with different lighting conditions in a simulated indoor environment. Extra modifier such as variations in robot nominal velocity is applied as well.

Abstract

Data Preparation

We applied 4 lighting configurations to ReplicaCAD dataset: "Baseline", "Nominal Lights", "Camera Light" and "Dynamic Lights". Additionally, we show "Velocity" modifier meaning we recorded sequences at doubled velocity with "Baseline" lighting configuration.

We applied 2 lighting configurations to Habitat Matterport 3D dataset: "Nominal Lights" and "Camera Light". Additionally, we show "Velocity" modifier meaning we recorded sequences at doubled velocity with "Nominal lights" lighting configuration.

Visual Question Answering

We utilize LLM and LVLM to get scene frames descriptions and construct a set of questions and corresponding ground truth answers, which are additionally validated and balanced in order to avoid usage of ambiguous questions for the further evaluation of the scene graph.

Experiments

Semantic Segmentation

Semantic Graph Evaluation

The VQA pipeline generated on average 184 and 76 questions for each scene, which gives us in total 4055 and 611 questions for ReplicaCAD and HM3D respectively.

BibTeX