What are the differences between data catalogs, dictionaries, taxonomies and glossaries?

Rajat Venkatesh10/9/2020 3 Min Read

Metadata in a data lake is crucial for productivity within a data ecosystem. Understanding the links between different types of metadata, storage systems, and their consumers can be very confusing. How is a data catalog different from a dictionary or a glossary? This post will explore all aspects of metadata for data.

Information Schema

Information Schemas store basic metadata within a database. The information schema is of an ANSI-SQL standard and provides system information on tables, views, columns, users, permissions, and other database-specific information. Database administrators use the Information Schema to monitor the internal state of the database.

It is typically accessed through SQL statements or non-standard commands like SHOW or DESCRIBE at the database prompt or in scripts.

The example from MySQL documentation lists all tables in a schema “db5” and system or database-specific information like the engine. The Hive Metastore and AWS Glue Data Catalog are popular information schemas in data lakes.

There are multiple instances of an information schema - one per database in the organization.

Data Catalog

The Data Catalog is a system-wide inventory of all the data assets. An analogy is to compare data catalogs to catalogs in a library. A library catalog stores information on book availability, edition, authors, description, and other metadata. Just like a library catalog can be used to discover data, data catalogs help to explore data assets. Different personas require a data catalog. Examples are: Data engineers want to know the impact of a new feature in ETL pipelines. Data scientists and analysts use data catalogs to find the right data sets for their work. Data stewards scan data catalogs to ensure compliance with security and governance policies.

A primary source of the data catalog is the information schema from all the databases, data warehouses, and data lakes. It will also contain other technical information like lineage, ETL scripts, ACLs, and access history.

A data catalog is typically available through a UI web interface and has APIs for scripting. Popular open-source data catalogs are DataHub and Amundsen.

Business Glossaries

Business glossaries define various business terms. A simple example is the definition of a customer or a lead. Without a business glossary, there can be different opinions on simple terminologies, such as a customer or purchase date.

Business glossaries add semantic meaning to data. While a data catalog may state that a column contains a date, a glossary provides information on the interpretation of that date. Is the date defined as the order date, delivery date, or payment date?

Data Dictionary

A data dictionary is a searchable repository of all business or semantic metadata of data assets. The difference from a data catalog is that it will also store business or semantic information about the data. Using the terms “data dictionary” and “data catalog” interchangeably creates confusion on when to use them. Another difference between the two - store business or semantic metadata - is not very large. Many data catalogs can store semantic information but hold the label of a dictionary. Therefore, technical audiences use data catalogs, whereas business audiences use dictionaries.

Taxonomies

The data taxonomy is an oddball in this list. A taxonomy describes how to assign metadata to data. It provides a framework to name things and to disambiguate when there is confusion about the semantics. For example, is the purchase date when the first payment was received or the last? Taxonomies also standardize terms within the data. An example is:

The above image shows two suppliers that both offer mechanical pencils. However, they may use different terms to describe the same features of a mechanical pencil.

A taxonomy provides a framework to standardize the values in the Color and Description columns.

Conclusion

The terminology and names of systems to manage metadata can be confusing. The post categorizes these based on who uses the metadata. If you found this post helpful, do drop a like and a comment and share the post with others!


Similar Posts

Get in touch for bespoke support for PII Catcher

We can help discover, manage and secure sensitive data in your data warehouse.