CIKM 2025, SIGMOD 2023

Stop Training KGQA on 30K Examples

by Abdelghny Orogat

QueryBridge provides over one million annotated natural language questions aligned with executable SPARQL queries, enabling robust training and evaluation of KGQA systems.

Get Started

QueryBridge: Million-Scale Semantic Mapping

QueryBridge is a large-scale dataset designed to address the long-standing data scarcity problem in Question Answering over Knowledge Graphs (KGQA). Unlike prior benchmarks that contain only thousands of questions, QueryBridge provides 1,004,534 natural language questions paired with executable SPARQL queries over DBpedia.

Million-Scale Training

Enables the development of data-intensive neural semantic parsers and large language models (LLMs) by surmounting the 30k-example "data wall" of legacy benchmarks.

Fine-Grained Supervision

Enriched with structural metadata and XML-style tags (<qt>, <p>, <o>, <cc>) to support both end-to-end and component-level evaluation.

Structural Taxonomy

Features comprehensive coverage of multi-hop paths, including Chain (6.77%), Star (56.63%), Tree, Cycle, Flower, and Set-Shape (13.65%) queries.

Maestro: The Generation Framework

QueryBridge is systematically generated via Maestro, the first framework to automatically produce comprehensive, utterance-aware benchmarks for any targeted Knowledge Graph.

Maestro Architecture Figure 1: Maestro architecture for automated benchmark generation. The system (1) selects representative seed entities, (2) instantiates diverse query shapes over the KG, and (3) lexicalizes graph patterns into natural language.

Resilience to KG Evolution

Static benchmarks become stale as KG ontologies evolve. Maestro addresses this by traversing the graph starting from selected seeds to find all valid subgraph shapes, ensuring benchmarks remain accurate.

Utterance-Aware Lexicalization

By mapping external text corpora to KG predicates, Maestro captures semantically-equivalent utterances. This results in high-quality natural language questions that are on par with manually-generated ones.

Bias-Free Sampling

Instead of random selection, Maestro uses Class Importance ($I_c$) and Entity Popularity ($P_e$) heuristics to ensure representative sampling of both common and tail entities.

Using QueryBridge via Hugging Face

QueryBridge is distributed through the Hugging Face datasets library, allowing researchers to seamlessly load, filter, and process the dataset without manual downloads.

Quick Load (Python)

Python · Hugging Face

from datasets import load_dataset

# Load QueryBridge (≈1.0M question–SPARQL pairs)
# Cached and streamed automatically by Hugging Face
dataset = load_dataset(
    "aorogat/QueryBridge"
)

# Inspect dataset structure
print(dataset)
print(dataset["train"].column_names)

# Example 1: Filter by query structure (STAR-shaped queries)
star_queries = dataset["train"].filter(
    lambda ex: ex["shapeType"] == "STAR"
)

print(
    f"Number of STAR queries: {len(star_queries)}"
)

# Example 2: Filter by reasoning complexity
complex_queries = dataset["train"].filter(
    lambda ex: ex["questionComplexity"] >= 0.7
)

# Example 3: Access token-level semantic supervision
sample = dataset["train"][0]

print("Raw question:")
print(sample["questionString"])

print("\nTagged question (NL ↔ SPARQL alignment):")
print(sample["questionStringTagged"])

# Example 4: Retrieve the executable SPARQL query
print("\nCorresponding SPARQL query:")
print(sample["query"])

Field-Level Schema

FieldDescription
questionStringThe raw natural language question.
questionStringTaggedXML-annotated version linking to SPARQL components.
queryThe executable SPARQL query for DBpedia.
shapeTypeThe structural query pattern (Chain, Star, Tree, etc.).
questionComplexityNormalized score based on tokens, triples, and keywords.
answerCardinalityNumber of gold standard answers returned.

How to Cite

If you use QueryBridge, Maestro, or any of their derived resources (e.g., subsets, extensions, benchmarks, or baselines) in your research, please cite the corresponding publications below. Proper citation supports reproducibility, transparency, and sustained maintenance of large-scale benchmarks.

QueryBridge (CIKM 2025) BibTeX citation
COPY
@inproceedings{10.1145/3746252.3761623,
author = {Orogat, Abdelghny and El-Roby, Ahmed},
title = {QueryBridge: One Million Annotated Questions with SPARQL Queries - Dataset for Question Answering over Knowledge Graphs},
year = {2025},
isbn = {9798400720406},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3746252.3761623},
doi = {10.1145/3746252.3761623},
booktitle = {Proceedings of the 34th ACM International Conference on Information and Knowledge Management},
pages = {6503–6507},
numpages = {5},
keywords = {dataset, knowledge graphs, question answering, sparql},
location = {Seoul, Republic of Korea},
series = {CIKM '25}
}
Use when: Dataset usage, training/evaluation baselines, derived subsets, or extensions of QueryBridge.
Maestro (PACMMOD 2023) BibTeX citation
COPY
@article{10.1145/3589322,
author = {Orogat, Abdelghny and El-Roby, Ahmed},
title = {Maestro: Automatic Generation of Comprehensive Benchmarks for Question Answering Over Knowledge Graphs},
year = {2023},
issue_date = {June 2023},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
volume = {1},
number = {2},
url = {https://doi.org/10.1145/3589322},
doi = {10.1145/3589322},
journal = {Proc. ACM Manag. Data},
month = jun,
articleno = {177},
numpages = {24},
keywords = {automatic generation, benchmarks, comprehensiveness evaluation, question answering over knowledge graphs}
}
Use when: Referencing the benchmark-generation methodology, regeneration, or applying Maestro to other knowledge graphs.
Note: Citations are provided in BibTeX for convenience. For paper submissions, keep the BibTeX entries unchanged to preserve DOI metadata.