Constructing Knowledge Graphs from Heterogeneous Data Sources: Use Case in Ericsson

A modular, six-layer framework for managing interlinked CloudRAN data using Knowledge Graphs, enabling efficient analytics and decision-making in telecommunications.

Abdelghny Orogat, Sri Lakshmi Vadlamani, Dimple Thomas, Ahmed El-Roby

Carleton University, Ottawa · Telefonaktiebolaget LM Ericsson, Stockholm

Publication

Conference CIKM '24
Location Boise, ID, USA
Publisher ACM

The Data Integration Challenge

In Cloud Radio Access Networks (CloudRAN), data originates from numerous heterogeneous sources: relational databases storing test configurations, JSON files containing execution results, text logs documenting system behavior, and metrics tracking operational performance. This diversity creates a fundamental challenge for telecommunications companies like Ericsson.

When engineers need to diagnose a test failure, they must manually traverse multiple disconnected systems. A test case resides in one database, its execution results in another, software version details in a third system, and deployment logs in yet another location. This fragmentation transforms routine investigations into hours-long data archaeology expeditions.

The Ericsogate Solution

Ericsogate addresses this problem by constructing a unified Knowledge Graph—a semantic network where entities from disparate sources are automatically identified, linked, and made queryable through their relationships.

What previously required manual correlation across five systems now resolves through a single traversal: click a failed test, immediately access its linked software version, the engineer who executed it, the hardware configuration, and related historical failures.

Knowledge Graph Fundamentals

A Knowledge Graph represents information as a network of entities (nodes) and their relationships (edges). Unlike traditional databases that store isolated records, Knowledge Graphs explicitly encode semantic connections.

Construction from Heterogeneous Sources

Consider three independent data sources at Ericsson:

  • Relational Database: Cell tower metadata (cell_id: 123, bandwidth: 20MHz, location: Ottawa)
  • JSON Configuration: Technical specifications (cell_id: 123, technology: 5G)
  • Text Documents: Geographic information ("Ottawa, population 1 million...")

Ericsogate identifies that cell_id 123 and Ottawa appear across sources, then creates unified representations:

  • Cell_123 → hasLocation → Ottawa
  • Cell_123 → usesTechnology → 5G
  • Cell_123 → hasBandwidth → 20MHz
  • Ottawa → hasPopulation → 1M

Queries can now traverse these connections: starting from a cell tower, follow edges to discover its technology, location, and demographic context—information that was previously siloed.

RDF Triples and Ontology

Ericsogate uses the Resource Description Framework (RDF) to model knowledge. Each fact becomes a triple: <subject, predicate, object>. For example:

RDF Triple Examples

<TestCase_42, hasResult, "Failed"> <TestCase_42, executedOn, Server_A> <Server_A, runsSoftware, Version_3.2.1> <Version_3.2.1, releasedBy, Team_CloudRAN>

The Knowledge Graph consists of two layers: Instance Data (specific facts about entities) and Ontology (class hierarchies and relationship schemas). The ontology defines that "Functional Test" and "Performance Test" are subclasses of "Test Case," enabling queries like "find all test cases" to automatically include both types without explicit enumeration.

Six-Layer Architecture

Each layer addresses specific requirements: data provenance, quality assurance, entity alignment, horizontal scalability, access control, and data freshness.

Ericsogate Architecture Diagram

Figure 1. The six-layer architecture showing data flow from ingestion through transformation, control, storage, API exposure, and application consumption.

Layer 1

Ingestion Layer

Interfaces with heterogeneous data sources via APIs, extracting entities and properties from structured, semi-structured, and unstructured content.

  • Raw Data Readers: Source-specific adapters for databases, APIs, log files
  • Parsers: Regular expressions to ML models for entity extraction
  • Data Cleaning: Duplicate removal, spelling correction, format normalization
  • Origin Trace: Metadata capture for provenance tracking
Layer 2

Transformation Layer

Converts heterogeneous formats into standardized JSON triples (subject-predicate-object), enriched with local ontologies defining source-specific class hierarchies.

  • JSON Transformers: Source-specific conversion to uniform triple format
  • Local Ontologies: Class hierarchy for each data source
  • Developer Interface: Standard JSON libraries, no RDF expertise required
Layer 3

Control Layer

Orchestrates the data pipeline: schedules ingestion tasks, aligns entities across sources, and integrates local ontologies into a global schema.

  • Ingestion Registry: Tracks data sources and refresh rates
  • Scheduler: Automated refresh from milliseconds to days
  • Entity Alignment: Local and global matching using lexical and contextual similarity
  • JSON-RDF Transformer: Converts to RDF for storage
Layer 4

Data Management Layer

Stores the Knowledge Graph using Apache Jena triple stores, distributed across multiple engines for horizontal scalability with federated SPARQL queries.

  • Triple Store: Open-source Apache Jena for RDF storage
  • DM Controller: Data distribution across graph engines
  • Federated Queries: Transparent querying across distributed stores
  • Single-Graph Facade: Applications see unified graph
Layer 5

API Layer

Exposes REST APIs returning JSON, abstracting SPARQL complexity. Provides granular access control through credential-protected endpoints.

  • Endpoint Manager: Dynamic API endpoint generation
  • Request Handler: Translates REST to SPARQL
  • Access Control: Unique credentials per endpoint
  • JSON Response: Standard format for application layer
Layer 6

Application Layer

User-facing applications leveraging the Knowledge Graph: semantic search dashboards, summarization tools, recommendation systems, and error detection.

  • Semantic Search: Meaning-based queries across linked data
  • KG Summarization: 100K nodes to 100 insights
  • Recommendations: Related tests and configurations
  • Pattern Detection: Cross-failure analysis

Data Model and Requirements

Ericsogate addresses eight critical requirements for CloudRAN data management, each mapped to specific architectural components.

Requirements Mapping

R1 - Data Provenance: Ingestion Layer embeds origin metadata from source files, recording the complete data lineage for auditability and transparency.

R2 - Data Quality: Ingestion Layer enforces validation (format, type, value specifications) and cleaning (duplicates, irregularities, missing entries) before downstream processing.

R3 - Data Heterogeneity: Transformation Layer standardizes formats into JSON triples. Control Layer performs local matching (within sources) and global matching (across sources) using entity alignment algorithms.

R4 - Schema Flexibility: RDF's schema-later approach allows fluid data addition through triples without predefined schemas, enabling rapid integration of new data structures.

R5 - Scalability: Modular data readers (source scalability) and DM Controller distribution across multiple graph engines (horizontal scalability) ensure unlimited capacity growth.

R6 - Security: API Layer generates unique, credential-protected endpoints per user or group, ensuring users access only authorized data subsets.

R7 - Data Freshness: Scheduler in Control Layer automatically requests updates from data sources at configured intervals (milliseconds to days), maintaining current state.

R8 - Cost-Effectiveness: Deployment on open-source Apache Jena eliminates licensing costs while meeting functional requirements.

Experimental Evaluation

Comparative analysis of Ericsogate against traditional data management systems at Ericsson.

Table 1. Performance comparison across key operational metrics

Metric Traditional System Ericsogate (KG) Improvement
Update Speed 20H hours H hours 20× faster
Feature Development F features/month 10F features/month 10× throughput
Navigation Depth 2 hops Unlimited Infinite traversal
Summarization Compression ~80% 99.90% 19.9 percentage points

Note: H and F are normalization constants to preserve proprietary performance baselines.

Compression Rate
99.90%
Reduces 100,000 nodes to ~100 summary nodes while maintaining 94% representativeness
Representativeness
94%
Measured holistically across node count, relationship preservation, and stakeholder satisfaction
Horizontal Scalability
Linear
Summary size grows linearly with class count, unaffected by entity volume per class
Update Latency
20× Faster
Modifying search features reduced from 20H to H hours through automated query generation

Key Finding: Democratized Development

A critical outcome is the reduction in Knowledge Graph expertise requirements. Only developers working on the Control and API layers require SPARQL knowledge. Data ingestion developers work with standard JSON, application developers consume REST APIs—dramatically reducing training overhead and enabling faster team onboarding.

Industrial Applications

Application 1

Knowledge Graph Summarization

Problem

Stakeholders needed to understand test execution patterns across 100,000+ nodes representing test cases, runs, software versions, and configurations. Traditional grouping methods obscured critical distinctions, bit compression optimized storage without facilitating analysis, and simplification techniques risked omitting important patterns.

KG Solution

Pattern-Statistics-based Method leverages the ontology structure to abstract instances by their class types. Instead of showing TestCase_1, TestCase_2, TestCase_3 individually, the system displays "TestCase" class with aggregated statistics: "3 instances, 2 functional, 1 performance, executed 4 times total." With class hierarchies, stakeholders see test type distributions, execution frequencies, and failure concentrations without raw data overload.

Outcome

Achieved 99.90% compression (100K to ~100 nodes) while maintaining 94% representativeness. Stakeholders gained immediate pattern visibility: which test types dominate, execution frequency distributions, and failure concentration areas—enabling informed strategic decisions without manual analysis.

Application 2

Semantic Search Dashboard

Problem

Engineers investigating test failures needed to manually query multiple disconnected systems: test case definitions in one database, execution results in another, software versions in a third, deployment logs in a fourth. This fragmentation transformed routine diagnoses into multi-hour investigations requiring specialized knowledge of each system's query interface.

KG Solution

Semantic Search enables meaning-based queries across linked data. Users filter by fields from multiple sources through a unified dashboard. Clicking a failed test run triggers traversal through the Knowledge Graph, instantly retrieving the linked software version, engineer who executed it, hardware configuration, and related historical failures—data previously requiring correlation across five independent queries.

Outcome

Investigation time reduced from hours to seconds. Feature development accelerated 10× (F to 10F features/month) due to unified data access. Engineers now perform unlimited-depth navigation compared to the previous 2-hop limitation, enabling discovery of non-obvious relationships between test configurations and failure modes.

Citation

For use of Ericsogate framework, methodology, or applications in research.

BibTeX Entry
Ericsogate: Advancing Analytics and Management of Data from Diverse Sources within Ericsson Using Knowledge Graphs
@inproceedings{orogat2024ericsogate, author = {Orogat, Abdelghny and Vadlamani, Sri Lakshmi and Thomas, Dimple and El-Roby, Ahmed}, title = {Ericsogate: Advancing Analytics and Management of Data from Diverse Sources within Ericsson Using Knowledge Graphs}, year = {2024}, booktitle = {Proceedings of the 33rd ACM International Conference on Information and Knowledge Management (CIKM '24)}, doi = {10.1145/3627673.3680033}, location = {Boise, ID, USA}, publisher = {ACM} }