Skip to main content
Version: 3.0

1.3 Data Engineering & Data Integration

Data Engineering and Data Integration in Data Context Hub transform incoming data from diverse sources into a unified knowledge graph. This process involves three key stages:

  1. Data Ingestion: Data is collected from various sources via Intake Agents
  2. Staging Layer: Incoming data is transformed and modeled using Target Entities and Relationships
  3. Graph Processing: The staged data is processed into a queryable knowledge graph stored in Neo4j

Staging Layer

The Staging Layer serves as an intermediate data transformation environment where raw data from various sources is standardized and prepared for knowledge graph generation. It consists of three main components:

  • Intake Agents: Import data from external sources
  • Target Entities: Define the structure and schema for data objects
  • Relationships: Establish connections between different Target Entities

The Staging Layer provides a flexible environment where data can be shaped, validated, and enriched before being transformed into the final knowledge graph structure.

Intake Agents

Intake Agents are standalone applications that bridge external data sources with Data Context Hub. They act as specialized connectors that:

  • Connect to various data sources (databases, file systems, APIs, etc.)
  • Extract and retrieve data from these sources
  • Handle data transformation and formatting
  • Deliver prepared data to Data Context Hub for import operations

Each Intake Agent operates as an independent service, enabling seamless integration without tight coupling to the Data Context Hub core system. For implementation details and development guidance, refer to Intake Agents.

Target Entities

A Target Entity represents a real-world object or concept within the Data Context Hub model, such as:

  • Physical objects (devices, machines, products)
  • People (employees, customers, contacts)
  • Abstract concepts (projects, documents, transactions)

Key Characteristics

  • Properties: Each entity has attributes that describe it (e.g., name, type, description, status)
  • Business Key: One property must be designated as unique to serve as the Business Key
  • Table Representation: In the Staging Layer, each Target Entity is represented as a table
    • Each row represents a single instance of the entity
    • Each column represents a property or attribute
    • During processing, rows are transformed into nodes in the knowledge graph

Target Entities form the foundation of the knowledge graph data model, defining what types of information will be stored and how they are structured.

Relationships

Relationships define connections between Target Entities, establishing how different data objects relate to one another. They enable the creation of a meaningful network of interconnected data.

Relationship Configuration

Each relationship definition includes:

  • From Entity: The source Target Entity
  • To Entity: The destination Target Entity (can be the same as From Entity for self-referencing relationships)
  • Matching Properties: Properties from both entities that are compared to determine if a relationship exists

Relationship Evaluation

Relationships are evaluated during the data loading process:

  1. Incoming data arrives and is loaded into Target Entity tables in the Staging Layer
  2. Data Context Hub compares property values based on relationship definitions
  3. When matching conditions are met, relationship objects are created in the Staging Layer
  4. During Graph Processing, these relationship objects become edges connecting nodes in the knowledge graph

Knowledge Graph Processing

Processing is the final stage of the data integration workflow where data from the Staging Layer is transformed into a knowledge graph. This process converts Target Entity rows into nodes and Relationship objects into edges, then stores them in the Neo4j graph database.

Data Context Hub supports three processing strategies, each optimized for different use cases and requirements. Knowledge Graph Processing follows a consistent set of steps regardless of the strategy employed:

  1. Extract: Rows from Target Entity tables are retrieved from the Staging Layer
  2. Transform: Entity rows are converted into graph nodes with their properties
  3. Connect: Relationship objects are transformed into edges connecting nodes
  4. Store: The graph structure is written to Neo4j

The choice of processing strategy determines how these steps are applied—whether to the entire dataset or incrementally.

Clear & Process

The Clear & Process strategy provides a complete refresh of the knowledge graph by removing all existing nodes and edges before generating new ones from the Staging Layer.

When to Use

This strategy is ideal when:

  • The data volume is manageable and processing time is acceptable
  • A clean slate is needed to eliminate outdated or orphaned data
  • Complete consistency with the source data is critical
  • The graph structure or model has changed significantly

Process Flow

  1. Clear: All existing nodes and edges are removed from the graph database
  2. Extract: All rows from Target Entity tables are retrieved
  3. Transform: Entity rows are converted into nodes, Relationship objects into edges
  4. Load: The complete new graph structure is written to Neo4j

This approach ensures the knowledge graph is an exact representation of the current Staging Layer data, with no remnants of previous processing runs.

Additive Processing

Additive Processing, also known as incremental processing, updates the existing knowledge graph by synchronizing it with the current state of the Staging Layer without clearing existing data.

When to Use

This strategy is optimal when:

  • The knowledge graph is large and complete regeneration would be time-consuming
  • System resources need to be conserved
  • Continuous availability of the graph is required
  • Changes to the data are incremental rather than wholesale

Process Flow

  1. Compare: The current Staging Layer data is compared with the existing graph
  2. Update: Nodes that already exist in the graph are updated with new property values
  3. Create: New nodes from the Staging Layer that don't exist in the graph are added
  4. Delete: Nodes that no longer exist in the Staging Layer are removed from the graph
  5. Sync Relationships: Edges are updated to reflect current Relationship objects

This approach minimizes processing time and system load by only modifying what has changed, making it efficient for large-scale knowledge graphs.

Accessing the Knowledge Graph

After processing completes, the knowledge graph becomes available for:

  • Exploration: Navigate the graph visually using the Explorer interface to discover connections and patterns
  • Querying: Execute custom Cypher queries through projections to extract specific insights
  • Integration: Access graph data through Data Context Hub APIs for custom applications and dashboards

The knowledge graph provides a powerful, flexible way to discover insights, understand relationships, and analyze complex interconnected data across your organization.