Analysts often cannot discover what data exists or how much their peers have processed it. Often, valuable data is left unused or human effort duplicates due to the absence of a data catalog in a schema-on-use world with raw data that requires preparation. “Tribal knowledge” describes how organizations manage this productivity problem, but this is not a systematic solution and scales very poorly as organizations grow.
Governance Risk is becoming an issue as regulators clamp down on the use and transfer of data. For instance, in the European Union, if your data contains personal information and your organization doesn’t comply with the General Data Protection Regulations (GDPR), you could be liable to a fine of $20M or 4% of revenue.
Data management entails tracking or controlling who accesses data, what they do with it, where they put it, and how it gets consumed downstream. Without a standard place to store metadata and answer these questions enforcing policies and auditing behavior becomes challenging.
A list of open source data catalog projects are:
A Data Catalog enables many applications to improve productivity and governance. Shown below is a representative list of applications:
AWS Glue is a fully managed extract, transform, and load (ETL) service that makes it easy for customers to prepare and load their data for analytics. Along with the ETL service, it also provides a data catalog built on Apache Hive Metastore.
Google Data Catalog is a fully managed and scalable metadata management service that empowers organizations to quickly discover, manage, and understand all their data in Google Cloud.
Azure Data Catalog lets users discover the data sources they need and understand the data sources they find. At the same time, Data Catalog helps organizations get more value from their existing investments.
Get in touch for bespoke support for PII Catcher
We can help discover, manage and secure sensitive data in your data warehouse.