5.1 Data Pump development
General intended use of Data Pumps
Data Pumps are designed to load data into the Data Context Hub. The data source can be a database, file system, or any other source accessible from the Data Context Hub environment. Loading data with a Data Pump is a long-running process and can only be executed one at a time on the job queue scheduler. This means that during a running Data Pump job (that is put on the job queue scheduler) no other job can be processed in parallel.
While Data Pump jobs are processed no resource restrictions are imposed (e.g. memory, disk space …) by the environment. In other words, you must not exceed the workers environment resources.
To optimize the workload, you can set up different containers instead of doing one single import (if possible).
Requirements
Development environment
- Language: C#
- Application output type: Class Library (.dll)
- Target Framework: .NET 8.0
- No target operating system
Dependencies
The following dependencies are required for developing a Data Pump and are provided as NuGet packages through GitLab. In order to access the packages an Access Token is required.
- Explore.DataPumps: Base implementation of DataPumps
- Explore.Common.SharedResources: Common shared resources (e.g.
DataPumpSourceMap
,DataPumpArtifact
etc.)
These two NuGet packages are also part of GBS and don't need to be uploaded!
Packages can be downloaded with the following commands
# Get available versions
curl 'https://gitlab.c64.ai/api/v4/projects/6/packages/nuget/download/<package-name>/index' \
--header 'PRIVATE-TOKEN: <access-token>'
# Download package
curl 'https://gitlab.c64.ai/api/v4/projects/6/packages/nuget/download/<package-name>/<version>/<filename.version.nupkg>' \
--header 'PRIVATE-TOKEN: <access-token>'
The filename
for a package is it's lower case variant, e.g. filename
for Explore.DataPumps
is explore.datapumps
.
Check the following compatibility chart to select the correct package version
Package | Version | GBS Compatibility |
---|---|---|
Explore.DataPumps | ||
2.0.3 | >= 2.0.x | |
3.0.1 | 2.1.x - 2.2.x | |
4.x | >= 2.3.x | |
Explore.Common.SharedResources | ||
2.0.4 | >= 2.x | |
3.0.0 | 2.1.x - 2.2.x | |
4.x | >= 2.3.x |
Data Pump Interface
Following sections explains the structure of the Data Pump interface each Data Pump is based on.
Parameters
public override string Parameters { get; }
Serialized list of parameters needed to connect a Data Pump with its data source. It must be set in the Constructor of the data pump:
private readonly string MODULE_KEY = null;
public override Uri ApiUrl { get; }
public override string Ident { get; } = Guid.Parse("00000000-0000-0000-0000-000000000000").ToString();
public override string Parameters { get; }
public DataPumpExample()
{
this.ApiUrl = new Uri("https://localhost:4040");
var params = new List<DataPumpParameter>
{
new () { Id = "xpl_dp_param_parameter_1", Name = "Parameter 1", Value = "", ForceReInitialize = true, Description = "Description", DisplayInContext = true },
new () { Id = "xpl_dp_param_parameter_2", Name = "Parameter 2", Value = "", IsMasked = true, Description = "Description", DisplayInContext = true },
new () { Id = "xpl_dp_param_parameter_3", Name = "Parameter 3", Value = "", IsMultiline = true, Description = "Description" }
};
this.Parameters = JsonSerializer.Serialize(params);
}
Version
Unique Version number that needs to be increased based on the changes in the Data Pump according to Semantic Versioning.
The version number could be represented in a text file VERSION linked to the project:
1.0.0
<Project Sdk="Microsoft.NET.Sdk">
<PropertyGroup>
<TargetFramework>net8.0</TargetFramework>
<Version>$(PackageVersion)</Version>
</PropertyGroup>
<ItemGroup>
<!-- Copy version file to the package -->
<None Include="../../VERSION" Pack="true" CopyToOutputDirectory="Always" PackagePath="/" />
</ItemGroup>
</Project>
GetVersion() implementation
public override string GetVersion()
{
var version = Assembly.GetExecutingAssembly().GetName().Version;
return version == null ? "undefined" : $"{version.Major}.{version.Minor}.{version.Build}";
}
Module key
Returns the license module key and must be null for customer written Data Pumps.
Ident
The Ident is a unique number represented as a GUID with the following syntax: 00000000-0000-0000-0000-000000000000.
It’s mandatory that every Data Pump with a new version number gets a different Ident. You can’t upload two Data Pumps to GBS with the same ident.
GetEntityListAsync()
public abstract Task<List<string>> GetEntityListAsync(List<Explore.DataPumps.Entities.DataPumpParameter> parameters)
Returns the entity names that would be provided during the loading process for a provided parameter set. This should be implemented as a shortcut to get entity names. E.g. in case of a database this function would return all found table names as entities.
parameters
: is aList<DataPumpParameter>
that is provided from the GBS UI needed to connect to data source ( see Parameters)entityNames
: List of entity names retrieved byGetEntityListAsync
PreLoadAsync()
public abstract Task<List<Explore.DataPumps.Entities.DataPumpSourceMap>> PreLoadAsync(List<Explore.DataPumps.Entities.DataPumpParameter> parameters, List<string> entityNames)
This function will be used for the initialization of containers. It is a shortcut to get a small dataset that later is mapped to the data entities and the related columns. E.g. 15 rows from each database table.
LoadAsync()
public abstract Task<List<Explore.DataPumps.Entities.DataPumpSourceMap>> LoadAsync(List<Explore.DataPumps.Entities.DataPumpParameter> parameters, List<string> entityNames)
The general entry point to retrieve data from the data source. Within this function the loading of data and any additional data enrichment can be done.
GetArtifactsAsync()
public abstract Task<List<Explore.DataPumps.Entities.DataPumpArtifact>> GetArtifactsAsync(List<Explore.DataPumps.Entities.DataPumpParameter> parameters, List<string> entityNames, System.Data.DataTable currentRowsInRepository = null)
Returns all found file artifacts from the source. E.g. all attachments from Jira issues.
currentRowsInRepository
is the actual data of the target entity in GBS.
DataPumpArtifact
The DataPumpArtifact
class represents an artifact and has the following structure:
Variable Name | Description |
---|---|
Entity | Found entity name from GetEntityListAsync (e.g. Jira Issue Key) |
ArtifactTypeID | Type of Artifact (see list below) |
DataMapSourceIndex | Index of row in currentRowsInRepository that contains the entity name (e.g. Jira Issue Key) => if -1 not found |
Value | Found artifact value (e.g. whole Url to file attachment) |
Title | Title of the artifact (e.g. file.png) |
Following list contains all artifact types supported by GBS. Data Context Hub Explorer offers different operations based on the artifact type.
- Unknown = 1
- Video = 2
- Text = 3
- Image = 4
- GeometryFile = 5
- DataSet = 8
- Number = 9
- PDFDocument = 10
- Link = 11
Logging
Several different debugging levels are available in the DataPumpBase
class. The logs will be shown in the GBS Log.
protected void LogDebug<T>(string message, T source, [CallerMemberName] string callerMethod = "")
protected void LogInfo<T>(string message, T source, [CallerMemberName] string callerMethod = "")
protected void LogWarn<T>(string message, T source, [CallerMemberName] string callerMethod = "")
protected void LogError<T>(string message, Exception ex, T source, [CallerMemberName] string callerMethod = "")
protected void LogException<T>(string message, Exception ex, T source, [CallerMemberName] string callerMethod = "")