FAQs on Data Lineage

Rajat Venkatesh — 2/29/2020 — 1 Min Read

In Biology, lineage is a sequence of species considered to have evolved from their respective predecessors. Similarly, Data Lineage is the sequence of transformations through intermediary systems to a final dataset. Datasets draw information from predecessors processed using a SQL query or a program in a language such as Python or Scala. Data Lineage can be at any granular level - schema, table, or column.

Why is data lineage important?

Data Lineage enables data governance functions such as:

Business Rules Verification
Change Impact Analysis
Data Quality Verification

What is a data lineage tool?

A Data Lineage Tool captures metadata of all data transformations, organizes the metadata in a graph, and provides access through visual interfaces and programmable APIs. Generally, data lineage tools use two techniques:

Push: ETL platforms push metadata to a data lineage tool during transformations.
Pull: Data Lineage tools scan logs and query history from databases and data lakes and generate lineage after the event. Some data lineage tools use both techniques.

Are there open-source data catalog tools?

How do you build data lineage solutions for databases?

Choose one of the open-source data catalog projects such as Amundsen, Apache Atlas, or Data Lineage.
Follow the installation instructions of the project. Some require a Hadoop cluster. Integrate ETL tools, databases, and data engines with the data lineage tool.
Integrate ETL tools, databases, and data engines with the data lineage tool.

Ready to dive in?