FAQs on Data Lineage

Rajat Venkatesh — 02/29/20201 Min Read — In Data Lineage

cover lineage

What is meant by data lineage ?

In Biology, lineage is a sequence of species each of which is considered to have evolved from its predecessor.

Similarly, Data Lineage is a sequence of transformations through intermediary systems to a final dataset. Each dataset is considered to have been created from its predecessor through a specific transformation. A transformation maybe a SQL query or a program in a language such as Python or Scala. Data Lineage can be at any granular level - schema, table or column.

Why is data lineage important ?

Data Lineage is important because it enables important data governance functions such as:

  • Business Rules Verification
  • Change Impact Analysis
  • Data Quality Verification

What is a data lineage tool ?

A Data Lineage Tool captures metadata of all data transformations, organizes the metadata in a graph and provides access to the graph through visual interfaces and programmable APIs.

In general data lineage tools use two techniques:

  • Push: ETL platforms push metadata to a data lineage tool during transformations.
  • Pull: Data Lineage tools scan logs and query history from databases and data lakes and generate lineage after the event.

Some data lineage tools use both techniques.

Are there open source data catalog tools ?

  1. Amundsen by Lyft
  2. Metacat by Netflix
  3. Tokern Lineage by Tokern.

How do you build data lineage solution for databases ?

  • Choose one of the open source data catalog projects such as Amundsen, [Apache Atlas] (https://atlas.apache.org/) or Lineage.
  • Follow installation instructions of the project. Some require a Hadoop cluster.
  • Integrate ETL tools, databases and data engines with the data lineage tool.