FAQs on Data Lineage
In Biology, lineage is a sequence of species each of which is considered to have evolved from its predecessor.
Similarly, Data Lineage is a sequence of transformations through intermediary systems to a final dataset. Each dataset is considered to have been created from its predecessor through a specific transformation. A transformation maybe a SQL query or a program in a language such as Python or Scala. Data Lineage can be at any granular level - schema, table or column.
Data Lineage is important because it enables important data governance functions such as:
- Business Rules Verification
- Change Impact Analysis
- Data Quality Verification
A Data Lineage Tool captures metadata of all data transformations, organizes the metadata in a graph and provides access to the graph through visual interfaces and programmable APIs.
In general data lineage tools use two techniques:
- Push: ETL platforms push metadata to a data lineage tool during transformations.
- Pull: Data Lineage tools scan logs and query history from databases and data lakes and generate lineage after the event.
Some data lineage tools use both techniques.
- Choose one of the open source data catalog projects such as Amundsen, [Apache Atlas] (https://atlas.apache.org/) or Lineage.
- Follow installation instructions of the project. Some require a Hadoop cluster.
- Integrate ETL tools, databases and data engines with the data lineage tool.