Open Semantic Mapping (OSM) is a key technology in robotic perception, combining semantic segmentation and SLAM techniques. This paper introduces a dynamically configurable and highly automated LLM/LVLM-powered pipeline for evaluating OSM solutions called OSMa-Bench (Open Semantic Mapping Benchmark). The study focuses on evaluating state-of-the-art semantic mapping algorithms under varying indoor lighting conditions, a critical challenge in indoor environments. We introduce a novel dataset with simulated RGB-D sequences and ground truth 3D reconstructions, facilitating the rigorous analysis of mapping performance across different lighting conditions. Through experiments on leading models such as ConceptGraphs, BBQ and OpenScene, we evaluate the semantic fidelity of object recognition and segmentation. Additionally, we introduce a Scene Graph evaluation method to analyze the ability of models to interpret semantic structure. The results provide insights into the robustness of these models, forming future research directions for developing resilient and adaptable robotic systems.
We assigned sets of 22 and 8 scenes for ReplicaCAD and HM3D datasets, respectively, distinct in test conditions' configurations:
We expanded the semantic description of the Replica CAD dataset. This made it possible to consider both
classes describing parts of the apartment (for example, wall
, floor
, stairs
)
and classes describing furniture and household utensils during testing.
Baseline configuration
Nominal Lights configuration
Camera Light configuration
Dynamic Lights configuration
Nominal Lights configuration
Camera Light configuration
Nominal Lights configuration
Camera Light configuration
We sample frames along the movement within a previously generated test scene. For each key frame, we employ LVLM to generate scene descriptions and further construct a set of questions each targeting specific aspects of scene understanding:
Is there a blue sofa?
).Is there a piano?
).Is there a chair AND a table?
).How many windows are present?
).What color is the door?
).Which object supports the table?
).What is in front of the staircase?
).Which is taller: the bookshelf or the lamp?
).
The evaluation was conducted for three methods: ConceptGraphs, OpenScene, and a very recent BBQ. These methods employ
different approaches, so we can perform benchmarking on a wider spectrum.
|
![]() |
The average ratio between different questions' categories is the following: 18.6% for Binary General, 16.6% for Binary Existence-Based, 18.4% for Binary Logical, 5.2% for Measurement, 17.0% for Object Attributes, 0.8% for Object Relations - Functional, 18.7% for Object Relations - Spatial, 4.7% for Comparison. Functional Relationships were challenging for LLM to interpret correctly, often leading to inconsistent or ambiguous answers. As a result, many of these questions were removed during the validation process, leaving only a small proportion in the final datasets. Due to the low number of these questions, their inclusion would make comparisons with other categories unreliable. Thus, we exclude them from evaluation.
BibTex Code Here