Skip to main content

Export to Datahub or Amundsen

Overview

Metadata stored in Tokern Catalog especially PII and column-level lineage can be exported to Datahub or Amundsen.

Datahub

dbcat provides a Source plugin. The source plugin has to be configured in an ingestion recipe.

CatalogSource accepts the following configuration:

  • path: Path to SQLite database
  • user: user name of role in Postgres Catalog
  • password: password of role in Postgres Catalog
  • host: host name of Postgres Catalog
  • db: database name of role in Postgres Catalog
  • port: Port number of Postgres Catalog
  • secret: Secret Key to encrypt passwords and tokens in the Catalog
  • source_names: List of sources to export
  • include_schema_regex: List of regular expressions that specify which schemata to include. Refer include_exclude_lists
  • exclude_schema_regex: List of regular expressions that specify which schemata to include. Refer include_exclude_lists
  • include_table_regex: List of regular expressions that specify which schemata to include. Refer include_exclude_lists
  • exclude_table_regex: List of regular expressions that specify which schemata to include. Refer include_exclude_lists
  • include_source_name: True/False Specify if table names should include source or not in the format source.schema.table. Useful when there are multiple databases
  • env: Environment variable expected by Databub. Default is PROD

Installation

# Install required libraries in a virtualenv
pip install dbcat[datahub]

# Create an ingestion recipe (see below)

# Run recipe
datahub ingest -c contrib/datahub/export.yml

Example Recipes

Basic Recipe

The following configuration sets up Catalog Source with default configuration and the sink is to console:

source:
type: dbcat.datahub.CatalogSource
sink:
type: "console"

Postgres Catalog, specific source and include schema

source:
type: dbcat.datahub.CatalogSource
config:
user: tokern
password: passw0rd
host: postgres
database: tdb
secret: my_secret_password
source_names:
- redshift_prod
- bq_analysis
include_schema:
- events
sink:
type: "console"

To configure sinks, refer to Datahub metadata ingestion documentation

Amundsen

dbcat provides a CatalogExtractor to extract metadata information. The Extractor can be used in an Amundsen metadata ingestion pipeline.

CatalogExtractor accepts the following configuration:

  • catalog_config: accepts a dictionary with connection parameters as described catalog configuration
  • source_names: List of sources to export
  • include_schema_regex: List of regular expressions that specify which schemata to include. Refer include_exclude_lists
  • exclude_schema_regex: List of regular expressions that specify which schemata to include. Refer include_exclude_lists
  • include_table_regex: List of regular expressions that specify which schemata to include. Refer include_exclude_lists
  • exclude_table_regex: List of regular expressions that specify which schemata to include. Refer include_exclude_lists

Check out an example loader in Github project