ReLaGS: Relational Language Gaussian Splatting

CVPR 2026

* Equal contribution
1German Research Center for Artificial Intelligence (DFKI), 2 RPTU Kaiserslautern-Landau, 3 University of Modena and Reggio Emilia

ReLaGS enables structural and relational reasoning over 3D Gaussian Splatting scenes using open-vocabulary language queries — no scene-specific training required.

Abstract

Achieving unified 3D perception and reasoning across tasks such as segmentation, retrieval, and relation understanding remains challenging, as existing methods are either object-centric or rely on costly training for inter-object reasoning.

We present a novel framework that constructs a hierarchical language-distilled Gaussian scene and its 3D semantic scene graph without scene-specific training. A Gaussian pruning mechanism refines scene geometry, while a robust multi-view language alignment strategy aggregates noisy 2D features into accurate 3D object embeddings. On top of this hierarchy, we build an open-vocabulary 3D scene graph with Vision Language-derived annotations and Graph Neural Network-based relational reasoning.

Our approach enables efficient and scalable open-vocabulary 3D reasoning by jointly modeling hierarchical semantics and inter/intra-object relationships, validated across tasks including open-vocabulary segmentation, scene graph generation, and relation-guided retrieval.

Pipeline

ReLaGS Pipeline

ReLaGS Overview. Given a reconstructed Gaussian scene, redundant primitives are first pruned to improve geometric accuracy. Heuristic clustering under multi-level SAM supervision then forms a hierarchical scene structure, where each cluster is assigned a CLIP-based language feature with outlier rejection. Finally, open-vocabulary inter- and intra-object scene graphs are obtained either by lifting LLM-derived relations for semantic diversity or by using a pretrained graph network for efficient offline inference.

Qualitative Results

We evaluate open-vocabulary retrieval on fine-grained and part-level queries. For multi-instance queries, OpenGaussian retrieves only a subset of instances while THGS and ReLaGS both recover the full set. The key distinction emerges on part-level queries — “pirate hat” and “kamaboko” — where all baselines fail entirely and only ReLaGS succeeds.

correct
partial / incomplete
miss / wrong
Scene 1 — figurines

Original scene

Q1
“rubber duck”
two instances present in the scene
1 of 2
OpenGaussian
2 of 2
THGS
2 of 2
ReLaGS (ours)
Q2
“pirate hat”
small part-level object on figurine
no result
OpenGaussian
no result
THGS
correct
ReLaGS (ours)
Scene 2 — ramen

Original scene

Q1
“egg”
two instances present in the scene
1 of 2
OpenGaussian
2 of 2
THGS
2 of 2
ReLaGS (ours)
Q2
“kamaboko”
fine-grained ramen topping; others confuse with nearby ingredients
wrong
OpenGaussian
wrong
THGS
correct
ReLaGS (ours)

We evaluate relational queries that require understanding spatial relationships between objects. In the disambiguation scene, both queries differ only in their relational context — ReLaGS correctly retrieves a different towel for each query, while THGS ignores the relation and returns the same object both times. In the general scene, THGS either conflates objects with their surroundings or returns nothing meaningful, while RelationField’s voxel-based activations highlight only a partial region of the correct object.

correct
partial / incomplete
miss / wrong
Scene 1 — bathroom  ·  object disambiguation

Original scene

Two towels present. Each query should retrieve a different one.

Q1
“towel hanging on wall”
expected: towel A
towel A ✓
THGS
towel A ✓
ReLaGS (ours)
Q2
“towel hanging on bathroom cabinet”
expected: towel B — THGS returns the same towel as Q1
same as Q1
THGS
towel B ✓
ReLaGS (ours)
Scene 2 — office  ·  general relational queries

Original scene

Q1
“picture hanging on wall”
THGS conflates picture and wall, selecting both
picture + wall
THGS
RelationField — picture hanging on wall partial region
RelationField voxel activation map
correct
ReLaGS (ours)
Q2
“monitor standing on desk”
THGS returns no meaningful selection
wrong
THGS
RelationField — monitor standing on desk partial region
RelationField voxel activation map
correct
ReLaGS (ours)

Runtime & Resource Comparison

We compare resource usage against RelationField — the closest method to ours in capability. ReLaGS injects structured semantic representation into a Gaussian scene with less than 25% memory overhead, while being substantially faster to train and leaner on disk.

Training time
ReLaGS – 12.6 min RelationField – 60 min
💾
Disk storage
ReLaGS – 65 MB RelationField – 500 MB
🖥
GPU memory
ReLaGS – 7.5 GB RelationField – 32 GB

* Values from Tab. 7. Resource usage is decomposed across the three stages of ReLaGS. Ablation studies on GNN design, pruning quality, and scene graph prediction are in the appendix.

Acknowledgements

This work has been partially funded by the EU projects dAIEDGE (GA Nr 101120726) and LUMINOUS (GA Nr 101135724).

BibTeX

@inproceedings{xiearafa2026relags,
  title     = {ReLaGS: Relational Language Gaussian Splatting},
  author    = {Xie, Yaxu and Arafa, Abdalla and Javanmardi, Alireza and
               Millerdurai, Christen and Hu, Jia Cheng and Wang, Shaoxiang and
               Pagani, Alain and Stricker, Didier},
  booktitle = {CVPR},
  year      = {2026}
}