FAQs on Data Catalog
A data catalog is important to solve two problems in modern data teams:
- Avoid poor productivity of people and the ROI of data.
- Governance Risk
Analysts are often unable to discover what data exists, much less how it has been previously used by peers. Valuable data is left unused and human effort is routinely duplicated—particularly in a schema-on-use world with raw data that requires preparation. “Tribal knowledge” is a common description for how organizations manage this productivity problem. This is clearly not a systematic solution, and scales very poorly as organizations grow.
This is becoming an increasingly serious issue as regulators clamp down on the use and transfer of data.
Data management necessarily entails tracking or controlling who accesses data, what they do with it, where they put it, and how it gets consumed downstream. In the absence of a standard place to store metadata and answer these questions, it is impossible to enforce policies and/or audit behaviour.
With e.g., the forthcoming GDPR regulations, if you can’t do that and your data contains information about people, then you could be liable for fines of $20M or 4% of revenue, whichever is the higher.
References: Ground Research Paper
A list of open source data catalog projects are:
- Amundsen by Lyft
- Metacat by Netflix
- Apache Atlas is a data governance and metadata management platform by the Apache community.
- PIICatcher Data Catalog by Tokern. PIICatcher is an open source and simple but effective data catalog for databases and filesystems.
- Choose one of the open source data catalog projects such as Amundsen, [Apache Atlas] (https://atlas.apache.org/) or PIICatcher.
- Follow installation instructions of the project. Some require a Hadoop cluster while others like PIICatcher is a simple python package.
- Integrate databases and data engines with the data catalog. The data catalog will scan and organize the metadata.
A Data Catalog enables many applications to improve productivity and governance. A representative list of applications is:
- Data Discovery
- Data Dictionary
- Data Provenance
- Measure ROI
- Privileged Access Management
- Auditing and compliance
AWS Glue is a fully managed extract, transform, and load (ETL) service that makes it easy for customers to prepare and load their data for analytics. Along with the ETL service it also provides a data catalog built on Apache Hive Metastore.
Google Data Catalog is a fully managed and scalable metadata management service that empowers organizations to quickly discover, manage, and understand all their data in Google Cloud.
Azure Data Catalog lets users discover the data sources they need and understand the data sources they find. At the same time, Data Catalog helps organizations get more value from their existing investments.