QueryBridge: One Million Annotated Questions with SPARQL Queries

Abdelghny Orogat; Ahmed El-Roby

doi:10.1145/3746252.3761623

QueryBridge: Million-Scale Semantic Mapping

QueryBridge is a large-scale dataset designed to address the long-standing data scarcity problem in Question Answering over Knowledge Graphs (KGQA). Unlike prior benchmarks that contain only thousands of questions, QueryBridge provides 1,004,534 natural language questions paired with executable SPARQL queries over DBpedia.

Million-Scale Training

Enables the development of data-intensive neural semantic parsers and large language models (LLMs) by surmounting the 30k-example "data wall" of legacy benchmarks.

Fine-Grained Supervision

Enriched with structural metadata and XML-style tags (<qt>, <p>, <o>, <cc>) to support both end-to-end and component-level evaluation.

Structural Taxonomy

Features comprehensive coverage of multi-hop paths, including Chain (6.77%), Star (56.63%), Tree, Cycle, Flower, and Set-Shape (13.65%) queries.

Maestro: The Generation Framework

QueryBridge is systematically generated via Maestro, the first framework to automatically produce comprehensive, utterance-aware benchmarks for any targeted Knowledge Graph.

Figure 1: Maestro architecture for automated benchmark generation. The system (1) selects representative seed entities, (2) instantiates diverse query shapes over the KG, and (3) lexicalizes graph patterns into natural language.

Resilience to KG Evolution

Static benchmarks become stale as KG ontologies evolve. Maestro addresses this by traversing the graph starting from selected seeds to find all valid subgraph shapes, ensuring benchmarks remain accurate.

Utterance-Aware Lexicalization

By mapping external text corpora to KG predicates, Maestro captures semantically-equivalent utterances. This results in high-quality natural language questions that are on par with manually-generated ones.

Bias-Free Sampling

Instead of random selection, Maestro uses Class Importance ($I_c$) and Entity Popularity ($P_e$) heuristics to ensure representative sampling of both common and tail entities.

Using QueryBridge via Hugging Face

QueryBridge is distributed through the Hugging Face datasets library, allowing researchers to seamlessly load, filter, and process the dataset without manual downloads.

Quick Load (Python)

Python · Hugging Face


from datasets import load_dataset

# Load QueryBridge (≈1.0M question–SPARQL pairs)
# Cached and streamed automatically by Hugging Face
dataset = load_dataset(
    "aorogat/QueryBridge"
)

# Inspect dataset structure
print(dataset)
print(dataset["train"].column_names)

# Example 1: Filter by query structure (STAR-shaped queries)
star_queries = dataset["train"].filter(
    lambda ex: ex["shapeType"] == "STAR"
)

print(
    f"Number of STAR queries: {len(star_queries)}"
)

# Example 2: Filter by reasoning complexity
complex_queries = dataset["train"].filter(
    lambda ex: ex["questionComplexity"] >= 0.7
)

# Example 3: Access token-level semantic supervision
sample = dataset["train"][0]

print("Raw question:")
print(sample["questionString"])

print("\nTagged question (NL ↔ SPARQL alignment):")
print(sample["questionStringTagged"])

# Example 4: Retrieve the executable SPARQL query
print("\nCorresponding SPARQL query:")
print(sample["query"])

Field-Level Schema

Field	Description
`questionString`	The raw natural language question.
`questionStringTagged`	XML-annotated version linking to SPARQL components.
`query`	The executable SPARQL query for DBpedia.
`shapeType`	The structural query pattern (Chain, Star, Tree, etc.).
`questionComplexity`	Normalized score based on tokens, triples, and keywords.
`answerCardinality`	Number of gold standard answers returned.

How to Cite

If you use QueryBridge, Maestro, or any of their derived resources (e.g., subsets, extensions, benchmarks, or baselines) in your research, please cite the corresponding publications below. Proper citation supports reproducibility, transparency, and sustained maintenance of large-scale benchmarks.

QueryBridge (CIKM 2025) BibTeX citation

COPY

@inproceedings{10.1145/3746252.3761623,
author = {Orogat, Abdelghny and El-Roby, Ahmed},
title = {QueryBridge: One Million Annotated Questions with SPARQL Queries - Dataset for Question Answering over Knowledge Graphs},
year = {2025},
isbn = {9798400720406},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3746252.3761623},
doi = {10.1145/3746252.3761623},
booktitle = {Proceedings of the 34th ACM International Conference on Information and Knowledge Management},
pages = {6503–6507},
numpages = {5},
keywords = {dataset, knowledge graphs, question answering, sparql},
location = {Seoul, Republic of Korea},
series = {CIKM '25}
}

Use when: Dataset usage, training/evaluation baselines, derived subsets, or extensions of QueryBridge.

Maestro (PACMMOD 2023) BibTeX citation