FAQs on Data Catalog

Rajat Venkatesh1/30/2020 2 Min Read

Why do you need a data catalog?

  • A data catalog helps solve two problems in modern data teams:
  • Avoiding poor productivity and the ROI of data.
  • Mitigating Governance Risks


Productivity

Analysts often cannot discover what data exists or how much their peers have processed it. Often, valuable data is left unused or human effort duplicates due to the absence of a data catalog in a schema-on-use world with raw data that requires preparation. “Tribal knowledge” describes how organizations manage this productivity problem, but this is not a systematic solution and scales very poorly as organizations grow.


Governance Risk

Governance Risk is becoming an issue as regulators clamp down on the use and transfer of data. For instance, in the European Union, if your data contains personal information and your organization doesn’t comply with the General Data Protection Regulations (GDPR), you could be liable to a fine of $20M or 4% of revenue.


Data management entails tracking or controlling who accesses data, what they do with it, where they put it, and how it gets consumed downstream. Without a standard place to store metadata and answer these questions enforcing policies and auditing behavior becomes challenging.


Is there an open-source data catalog?

A list of open source data catalog projects are:


How do you build a data catalog?

  • Choose one of the open-source data catalog projects such as Amundsen, [Apache Atlas] (https://atlas.apache.org/), or PIICatcher.
  • Follow the installation instructions of the project. Some projects require a Hadoop cluster, while PIICatcher is a python package. Integrate databases and data engines with the data catalog. The data catalog will scan and organize the metadata.


What are data catalog use cases?

A Data Catalog enables many applications to improve productivity and governance. Shown below is a representative list of applications:

  • Data Discovery
  • Data Dictionary
  • Data Provenance
  • Measure ROI
  • Privileged Access Management
  • Auditing and compliance


Is there a data catalog in Amazon Web Services (AWS)?

AWS Glue is a fully managed extract, transform, and load (ETL) service that makes it easy for customers to prepare and load their data for analytics. Along with the ETL service, it also provides a data catalog built on Apache Hive Metastore.


Is there a data catalog in Google Cloud Platform(GCP)?

Google Data Catalog is a fully managed and scalable metadata management service that empowers organizations to quickly discover, manage, and understand all their data in Google Cloud.


Is there a data catalog in Azure?

Azure Data Catalog lets users discover the data sources they need and understand the data sources they find. At the same time, Data Catalog helps organizations get more value from their existing investments.


Similar Posts

Get in touch for bespoke support for PII Catcher

We can help discover, manage and secure sensitive data in your data warehouse.