Ericsogate: Constructing Knowledge Graphs from Heterogeneous Data Sources

The Data Integration Challenge

In Cloud Radio Access Networks (CloudRAN), data originates from numerous heterogeneous sources: relational databases storing test configurations, JSON files containing execution results, text logs documenting system behavior, and metrics tracking operational performance. This diversity creates a fundamental challenge for telecommunications companies like Ericsson.

When engineers need to diagnose a test failure, they must manually traverse multiple disconnected systems. A test case resides in one database, its execution results in another, software version details in a third system, and deployment logs in yet another location. This fragmentation transforms routine investigations into hours-long data archaeology expeditions.

The Ericsogate Solution

Ericsogate addresses this problem by constructing a unified Knowledge Graph—a semantic network where entities from disparate sources are automatically identified, linked, and made queryable through their relationships.

What previously required manual correlation across five systems now resolves through a single traversal: click a failed test, immediately access its linked software version, the engineer who executed it, the hardware configuration, and related historical failures.

Knowledge Graph Fundamentals

A Knowledge Graph represents information as a network of entities (nodes) and their relationships (edges). Unlike traditional databases that store isolated records, Knowledge Graphs explicitly encode semantic connections.

Construction from Heterogeneous Sources

Consider three independent data sources at Ericsson:

Relational Database: Cell tower metadata (cell_id: 123, bandwidth: 20MHz, location: Ottawa)
JSON Configuration: Technical specifications (cell_id: 123, technology: 5G)
Text Documents: Geographic information ("Ottawa, population 1 million...")

Ericsogate identifies that cell_id 123 and Ottawa appear across sources, then creates unified representations:

Cell_123 → hasLocation → Ottawa
Cell_123 → usesTechnology → 5G
Cell_123 → hasBandwidth → 20MHz
Ottawa → hasPopulation → 1M

Queries can now traverse these connections: starting from a cell tower, follow edges to discover its technology, location, and demographic context—information that was previously siloed.

RDF Triples and Ontology

Ericsogate uses the Resource Description Framework (RDF) to model knowledge. Each fact becomes a triple: <subject, predicate, object>. For example:

RDF Triple Examples

<TestCase_42, hasResult, "Failed"> <TestCase_42, executedOn, Server_A> <Server_A, runsSoftware, Version_3.2.1> <Version_3.2.1, releasedBy, Team_CloudRAN>

The Knowledge Graph consists of two layers: Instance Data (specific facts about entities) and Ontology (class hierarchies and relationship schemas). The ontology defines that "Functional Test" and "Performance Test" are subclasses of "Test Case," enabling queries like "find all test cases" to automatically include both types without explicit enumeration.

Six-Layer Architecture

Each layer addresses specific requirements: data provenance, quality assurance, entity alignment, horizontal scalability, access control, and data freshness.

Figure 1. The six-layer architecture showing data flow from ingestion through transformation, control, storage, API exposure, and application consumption.

Layer 1

Ingestion Layer

Interfaces with heterogeneous data sources via APIs, extracting entities and properties from structured, semi-structured, and unstructured content.

Raw Data Readers: Source-specific adapters for databases, APIs, log files
Parsers: Regular expressions to ML models for entity extraction
Data Cleaning: Duplicate removal, spelling correction, format normalization
Origin Trace: Metadata capture for provenance tracking

Layer 2

Transformation Layer

Converts heterogeneous formats into standardized JSON triples (subject-predicate-object), enriched with local ontologies defining source-specific class hierarchies.

JSON Transformers: Source-specific conversion to uniform triple format
Local Ontologies: Class hierarchy for each data source
Developer Interface: Standard JSON libraries, no RDF expertise required

Layer 3

Control Layer

Orchestrates the data pipeline: schedules ingestion tasks, aligns entities across sources, and integrates local ontologies into a global schema.

Ingestion Registry: Tracks data sources and refresh rates
Scheduler: Automated refresh from milliseconds to days
Entity Alignment: Local and global matching using lexical and contextual similarity
JSON-RDF Transformer: Converts to RDF for storage

Layer 4

Data Management Layer

Stores the Knowledge Graph using Apache Jena triple stores, distributed across multiple engines for horizontal scalability with federated SPARQL queries.

Triple Store: Open-source Apache Jena for RDF storage
DM Controller: Data distribution across graph engines
Federated Queries: Transparent querying across distributed stores
Single-Graph Facade: Applications see unified graph

Layer 5

API Layer

Exposes REST APIs returning JSON, abstracting SPARQL complexity. Provides granular access control through credential-protected endpoints.

Endpoint Manager: Dynamic API endpoint generation
Request Handler: Translates REST to SPARQL
Access Control: Unique credentials per endpoint
JSON Response: Standard format for application layer

Layer 6

Application Layer

User-facing applications leveraging the Knowledge Graph: semantic search dashboards, summarization tools, recommendation systems, and error detection.

Semantic Search: Meaning-based queries across linked data
KG Summarization: 100K nodes to 100 insights
Recommendations: Related tests and configurations
Pattern Detection: Cross-failure analysis

Data Model and Requirements

Ericsogate addresses eight critical requirements for CloudRAN data management, each mapped to specific architectural components.

Requirements Mapping

R1 - Data Provenance: Ingestion Layer embeds origin metadata from source files, recording the complete data lineage for auditability and transparency.

R2 - Data Quality: Ingestion Layer enforces validation (format, type, value specifications) and cleaning (duplicates, irregularities, missing entries) before downstream processing.

R3 - Data Heterogeneity: Transformation Layer standardizes formats into JSON triples. Control Layer performs local matching (within sources) and global matching (across sources) using entity alignment algorithms.

R4 - Schema Flexibility: RDF's schema-later approach allows fluid data addition through triples without predefined schemas, enabling rapid integration of new data structures.

R5 - Scalability: Modular data readers (source scalability) and DM Controller distribution across multiple graph engines (horizontal scalability) ensure unlimited capacity growth.

R6 - Security: API Layer generates unique, credential-protected endpoints per user or group, ensuring users access only authorized data subsets.

R7 - Data Freshness: Scheduler in Control Layer automatically requests updates from data sources at configured intervals (milliseconds to days), maintaining current state.

R8 - Cost-Effectiveness: Deployment on open-source Apache Jena eliminates licensing costs while meeting functional requirements.

Experimental Evaluation

Comparative analysis of Ericsogate against traditional data management systems at Ericsson.

Table 1. Performance comparison across key operational metrics

Metric	Traditional System	Ericsogate (KG)	Improvement
Update Speed	20H hours	H hours	20× faster
Feature Development	F features/month	10F features/month	10× throughput
Navigation Depth	2 hops	Unlimited	Infinite traversal
Summarization Compression	~80%	99.90%	19.9 percentage points

Note: H and F are normalization constants to preserve proprietary performance baselines.

Compression Rate

99.90%

Reduces 100,000 nodes to ~100 summary nodes while maintaining 94% representativeness

Representativeness

94%

Measured holistically across node count, relationship preservation, and stakeholder satisfaction

Horizontal Scalability

Linear

Summary size grows linearly with class count, unaffected by entity volume per class

Update Latency

20× Faster

Modifying search features reduced from 20H to H hours through automated query generation

Key Finding: Democratized Development

A critical outcome is the reduction in Knowledge Graph expertise requirements. Only developers working on the Control and API layers require SPARQL knowledge. Data ingestion developers work with standard JSON, application developers consume REST APIs—dramatically reducing training overhead and enabling faster team onboarding.

Industrial Applications

Application 1

Knowledge Graph Summarization

Problem

Stakeholders needed to understand test execution patterns across 100,000+ nodes representing test cases, runs, software versions, and configurations. Traditional grouping methods obscured critical distinctions, bit compression optimized storage without facilitating analysis, and simplification techniques risked omitting important patterns.

KG Solution

Pattern-Statistics-based Method leverages the ontology structure to abstract instances by their class types. Instead of showing TestCase_1, TestCase_2, TestCase_3 individually, the system displays "TestCase" class with aggregated statistics: "3 instances, 2 functional, 1 performance, executed 4 times total." With class hierarchies, stakeholders see test type distributions, execution frequencies, and failure concentrations without raw data overload.

Outcome

Achieved 99.90% compression (100K to ~100 nodes) while maintaining 94% representativeness. Stakeholders gained immediate pattern visibility: which test types dominate, execution frequency distributions, and failure concentration areas—enabling informed strategic decisions without manual analysis.

Application 2

Semantic Search Dashboard

Problem

Engineers investigating test failures needed to manually query multiple disconnected systems: test case definitions in one database, execution results in another, software versions in a third, deployment logs in a fourth. This fragmentation transformed routine diagnoses into multi-hour investigations requiring specialized knowledge of each system's query interface.

KG Solution

Semantic Search enables meaning-based queries across linked data. Users filter by fields from multiple sources through a unified dashboard. Clicking a failed test run triggers traversal through the Knowledge Graph, instantly retrieving the linked software version, engineer who executed it, hardware configuration, and related historical failures—data previously requiring correlation across five independent queries.

Outcome

Investigation time reduced from hours to seconds. Feature development accelerated 10× (F to 10F features/month) due to unified data access. Engineers now perform unlimited-depth navigation compared to the previous 2-hop limitation, enabling discovery of non-obvious relationships between test configurations and failure modes.

Citation

For use of Ericsogate framework, methodology, or applications in research.

BibTeX Entry

Ericsogate: Advancing Analytics and Management of Data from Diverse Sources within Ericsson Using Knowledge Graphs

@inproceedings{orogat2024ericsogate, author = {Orogat, Abdelghny and Vadlamani, Sri Lakshmi and Thomas, Dimple and El-Roby, Ahmed}, title = {Ericsogate: Advancing Analytics and Management of Data from Diverse Sources within Ericsson Using Knowledge Graphs}, year = {2024}, booktitle = {Proceedings of the 33rd ACM International Conference on Information and Knowledge Management (CIKM '24)}, doi = {10.1145/3627673.3680033}, location = {Boise, ID, USA}, publisher = {ACM} }

DOI 10.1145/3627673.3680033

Conference CIKM '24, Boise, ID

Publisher ACM, October 2024

Constructing Knowledge Graphs from Heterogeneous Data Sources: Use Case in Ericsson

Publication

The Data Integration Challenge

Knowledge Graph Fundamentals

Construction from Heterogeneous Sources

RDF Triples and Ontology

RDF Triple Examples

Six-Layer Architecture

Ingestion Layer

Transformation Layer

Control Layer

Data Management Layer

API Layer

Application Layer

Data Model and Requirements

Requirements Mapping

Experimental Evaluation

Key Finding: Democratized Development

Industrial Applications

Knowledge Graph Summarization

Problem

KG Solution

Outcome

Semantic Search Dashboard

Problem

KG Solution

Outcome

Citation