FAQs on Data Catalog

Rajat Venkatesh — 1/30/2020 — 2 Min Read

Why do you need a data catalog?

A data catalog helps solve two problems in modern data teams:
Avoiding poor productivity and the ROI of data.
Mitigating Governance Risks

Productivity

Analysts often cannot discover what data exists or how much their peers have processed it. Often, valuable data is left unused or human effort duplicates due to the absence of a data catalog in a schema-on-use world with raw data that requires preparation. “Tribal knowledge” describes how organizations manage this productivity problem, but this is not a systematic solution and scales very poorly as organizations grow.

Governance Risk

Governance Risk is becoming an issue as regulators clamp down on the use and transfer of data. For instance, in the European Union, if your data contains personal information and your organization doesn’t comply with the General Data Protection Regulations (GDPR), you could be liable to a fine of $20M or 4% of revenue.

Data management entails tracking or controlling who accesses data, what they do with it, where they put it, and how it gets consumed downstream. Without a standard place to store metadata and answer these questions enforcing policies and auditing behavior becomes challenging.

Is there an open-source data catalog?

A list of open source data catalog projects are:

Amunden by Lyft
Metacat by Netflix
Apache Atlas is a platform for data governance and metadata management by the Apache community.
PIICatcher Data Catalog by Tokern. PIICatcher is a simple-to-use and effective open-source data catalog for databases and filesystems.

How do you build a data catalog?

Choose one of the open-source data catalog projects such as Amundsen, [Apache Atlas] (https://atlas.apache.org/), or PIICatcher.
Follow the installation instructions of the project. Some projects require a Hadoop cluster, while PIICatcher is a python package. Integrate databases and data engines with the data catalog. The data catalog will scan and organize the metadata.

What are data catalog use cases?

A Data Catalog enables many applications to improve productivity and governance. Shown below is a representative list of applications:

Data Discovery
Data Dictionary
Data Provenance
Measure ROI
Privileged Access Management
Auditing and compliance

Is there a data catalog in Amazon Web Services (AWS)?

AWS Glue is a fully managed extract, transform, and load (ETL) service that makes it easy for customers to prepare and load their data for analytics. Along with the ETL service, it also provides a data catalog built on Apache Hive Metastore.

Is there a data catalog in Google Cloud Platform(GCP)?

Google Data Catalog is a fully managed and scalable metadata management service that empowers organizations to quickly discover, manage, and understand all their data in Google Cloud.

Is there a data catalog in Azure?

Azure Data Catalog lets users discover the data sources they need and understand the data sources they find. At the same time, Data Catalog helps organizations get more value from their existing investments.

Ready to dive in?

Analyze Access Permissions for AWS Glue and Lake Formation

Rajat Venkatesh — 12/9/2019 - 2 Min Read

Analyze your AWS Glue access permissions for managed access to PII, PHI in your data lakes.

Tutorial: Two Methods to Scan for Personally Identifiable Information (PII) in Data Warehouses

Rajat Venkatesh — 12/1/2021 - 4 Min Read

Learn how you can scan for PII information in your data warehouses like Datahub and Amundsen.

Get in touch for bespoke support for PII Catcher

We can help discover, manage and secure sensitive data in your data warehouse.