ReLaGS: Relational Language Gaussian Splatting

CVPR 2026

Yaxu Xie^1*, Abdalla Arafa^1,2*, Alireza Javanmardi¹, Christen Millerdurai², Jia Cheng Hu³, Shaoxiang Wang^1,2, Alain Pagani¹ Didier Stricker^1,2

^* Equal contribution

¹German Research Center for Artificial Intelligence (DFKI), ² RPTU Kaiserslautern-Landau, ³ University of Modena and Reggio Emilia

Paper arXiv Code (Coming soon)

ReLaGS enables structural and relational reasoning over 3D Gaussian Splatting scenes using open-vocabulary language queries — no scene-specific training required.

Abstract

Achieving unified 3D perception and reasoning across tasks such as segmentation, retrieval, and relation understanding remains challenging, as existing methods are either object-centric or rely on costly training for inter-object reasoning.

We present a novel framework that constructs a hierarchical language-distilled Gaussian scene and its 3D semantic scene graph without scene-specific training. A Gaussian pruning mechanism refines scene geometry, while a robust multi-view language alignment strategy aggregates noisy 2D features into accurate 3D object embeddings. On top of this hierarchy, we build an open-vocabulary 3D scene graph with Vision Language-derived annotations and Graph Neural Network-based relational reasoning.

Our approach enables efficient and scalable open-vocabulary 3D reasoning by jointly modeling hierarchical semantics and inter/intra-object relationships, validated across tasks including open-vocabulary segmentation, scene graph generation, and relation-guided retrieval.

Pipeline

ReLaGS Overview. Given a reconstructed Gaussian scene, redundant primitives are first pruned to improve geometric accuracy. Heuristic clustering under multi-level SAM supervision then forms a hierarchical scene structure, where each cluster is assigned a CLIP-based language feature with outlier rejection. Finally, open-vocabulary inter- and intra-object scene graphs are obtained either by lifting LLM-derived relations for semantic diversity or by using a pretrained graph network for efficient offline inference.

Qualitative Results

We evaluate open-vocabulary retrieval on fine-grained and part-level queries. For multi-instance queries, OpenGaussian retrieves only a subset of instances while THGS and ReLaGS both recover the full set. The key distinction emerges on part-level queries — “pirate hat” and “kamaboko” — where all baselines fail entirely and only ReLaGS succeeds.

correct

partial / incomplete

miss / wrong

Scene 1 — figurines

Original scene

“rubber duck”

two instances present in the scene

1 of 2

OpenGaussian

2 of 2

THGS

2 of 2

ReLaGS (ours)

“pirate hat”

small part-level object on figurine

no result

OpenGaussian

no result

THGS

correct

ReLaGS (ours)

Scene 2 — ramen

Original scene

“egg”

two instances present in the scene

1 of 2

OpenGaussian

2 of 2

THGS

2 of 2

ReLaGS (ours)

“kamaboko”

fine-grained ramen topping; others confuse with nearby ingredients

wrong

OpenGaussian

wrong

THGS

correct

ReLaGS (ours)

We evaluate relational queries that require understanding spatial relationships between objects. In the disambiguation scene, both queries differ only in their relational context — ReLaGS correctly retrieves a different towel for each query, while THGS ignores the relation and returns the same object both times. In the general scene, THGS either conflates objects with their surroundings or returns nothing meaningful, while RelationField’s voxel-based activations highlight only a partial region of the correct object.

correct

partial / incomplete

miss / wrong

Scene 1 — bathroom · object disambiguation

Original scene

Two towels present. Each query should retrieve a different one.

“towel hanging on wall”

expected: towel A

towel A ✓

THGS

towel A ✓

ReLaGS (ours)

“towel hanging on bathroom cabinet”

expected: towel B — THGS returns the same towel as Q1

same as Q1

THGS

towel B ✓

ReLaGS (ours)

Scene 2 — office · general relational queries

Original scene

“picture hanging on wall”

THGS conflates picture and wall, selecting both

picture + wall

THGS

partial region

RelationField voxel activation map

correct

ReLaGS (ours)

“monitor standing on desk”

THGS returns no meaningful selection

wrong

THGS

RelationField — monitor standing on desk

partial region

RelationField voxel activation map

correct

ReLaGS (ours)

Runtime & Resource Comparison

We compare resource usage against RelationField — the closest method to ours in capability. ReLaGS injects structured semantic representation into a Gaussian scene with less than 25% memory overhead, while being substantially faster to train and leaner on disk.

⏱

Training time

ReLaGS – 12.6 min RelationField – 60 min

💾

Disk storage

ReLaGS – 65 MB RelationField – 500 MB

🖥

GPU memory

ReLaGS – 7.5 GB RelationField – 32 GB

* Values from Tab. 7. Resource usage is decomposed across the three stages of ReLaGS. Ablation studies on GNN design, pruning quality, and scene graph prediction are in the appendix.

Acknowledgements

This work has been partially funded by the EU projects dAIEDGE (GA Nr 101120726) and LUMINOUS (GA Nr 101135724).

BibTeX

@inproceedings{xiearafa2026relags,
  title     = {ReLaGS: Relational Language Gaussian Splatting},
  author    = {Xie, Yaxu and Arafa, Abdalla and Javanmardi, Alireza and
               Millerdurai, Christen and Hu, Jia Cheng and Wang, Shaoxiang and
               Pagani, Alain and Stricker, Didier},
  booktitle = {CVPR},
  year      = {2026}
}