5.1 Installation
Kubernetes is the targeted platform for running Data Context Hub. This guide assumes a cluster exists and can be accessed for deployments. Data Context Hub is shipped with all the necessary components and services to run completely in Kubernetes without the need to setup external services. However, it is possible to use external services like PostgreSQL and Neo4j which is recommended in production environments. More information on how to configure Data Context Hub to run with external services can be found below.
We recommend the adoption of ArgoCD as a continuous delivery solution for managing Kubernetes applications.
Transfer Docker Images
Docker images are available on our registry and can be pulled from there. However, we recommend that you transfer the images to your own registry as we do not guarantee high availability of our registry.
You can get all available images with this command:
curl --header "PRIVATE-TOKEN: <token>" "https://gitlab.c64.ai/api/v4/projects/4/registry/repositories" | jq
Install Helm
We provide a Helm chart repository for the deployment of Data Context Hub on Kubernetes.
We provide a stable channel for Helm packages which contains the stable release packages of Data Context Hub. Use following command to add the Data Context Hub Helm repository:
helm repo add dch-stable https://gitlab.c64.ai/api/v4/projects/4/packages/helm/stable --username __token__ --password <token>
helm repo update
You can now deploy using helm install but before doing so make sure to read the next section about how to
configure Data Context Hub.
helm install datacontexthub dch-stable/explore --namespace <namespace>
Configuring Data Context Hub
Before deploying a Data Context Hub instance, ensure to read the following sections for any required configuration.
It is highly recommended to create your own values.yaml file.
Development Mode
The Development Mode is intended for use in non-production environments, such as testing or development setups. When activated, certificate validation is disabled.
If certificate validation is deactivated, this is done at the user's own risk. In this case, Context64 GmbH assumes no warranty or liability for risks that may arise from insecure or expired certificates, to the extent permitted by law.
To enable Development Mode, set the "mode" value in the values.yaml to "development".
Versions of 3rd Party Services
The values.yaml includes versions of 3rd party services Data Context Hub was tested with. While we do
not expect any problems when using different minor or patch versions, it is highly recommended to use the provided
versions.
Storage
Data Context Hub requires a set of volumes for persistence and data exchange between services in the system.
- Airflow
- Postgres
- Neo4j
- Weaviate
- Opensearch
By default, storage is configured to automatically create Persistent Volume Claims (PVC). You have to configure the
appropriate storage classes by setting <service>.persistence.storageClassName and <service>.persistence.size in
values.yaml. Since OpenSearch quickly takes up a lot of space, volumes are optional and deactivated
by default. You can enable persistence for these services by setting <service>.persistence.enabled: true.
Optionally, you can use local persistent volumes which will be written to the file system. To do so,
set storage.create_local_storage_pv: true, <service>.persistence.storageClassName: local-storage and adapt
pv_path_* variables. Remember to manually create all folders that are used in pv_path_* variables with sufficient
permissions before starting an instance.
SMTP Configuration in Airflow
Airflow supports sending emails when tasks are retried or fail. In order to have Airflow send emails following
variables have to be set in your values.yaml:
airflow__smtp__smtp_hostairflow__smtp__smtp_mail_fromairflow__smtp__smtp_port
Additionally, username and password of the SMTP user have to be configured in Airflow's UI:
- Create a new connection called "smtp_default"
- Fill in
Login(= username) andPasswordand save. It doesn't matter whichConnection Typeis chosen because only credentials are used.
Using External Services
Please check values.yaml to see which versions of the following services Data Context Hub was tested with.
PostgreSQL
The system can optionally work with a previously provisioned PostgreSQL instance (e.g. on a bare metal server or AWS
Aurora). In this case postgres.enabled has to be set to false and following configuration variables have to be
adapted:
postgres.hostpostgres.portpostgres.userpostgres.password
Depending on which database user is provided to Data Context Hub, several required databases are created
automatically by the system. Otherwise, all databases have to be present before starting Data Context Hub for the first
time. The following table lists all databases required by Data Context Hub as well as the corresponding
configuration variables in values.yaml.
| Database | Configuration | Notes |
|---|---|---|
| airflow | airflow.database | |
| keycloak | keycloak.database | |
| dch_cont | environment.cont_database | Created automatically |
| dch_repo | environment.repo_database | Created automatically |
Neo4j
Neo4j is not included in the system and must be installed separately.
To integrate Neo4j, you need to set up and manage your own Neo4j or Neo4j AuraDB instance.
A sample configuration for this integration is available in the Helm chart.