Release data-lineage - An Open Source Python Project

Rajat Venkatesh — 03/15/20201 Min Read — In Data Lineage

Today we released an open source Python project data-lineage to visualize and analyze data lineage. The project was developed in collaboration with data teams on data governance initiatives over the last couple of years.

There are a lot of open source and commercial tools to capture data lineage. However there are two main problems by data engineers:

  • The projects require a lot of effort to get started and maintain.
  • Requires constant discipline in capturing and sending all the metadata.

Both these factors result in incomplete projects and lost opportunities in improving performance, ROI and data quality.

data-lineage solves these problems by choosing the following goals:

  • providing fast access to data lineage
  • simple setup
  • analysis of the lineage using a graph library

To achieve these goals, data lineage has the following features:

  1. Generate data lineage from query history. Most databases maintain query history for a few days. Therefore the setup costs of an infrastructure to capture and store metadata is minimal.
  2. Use networkx graph library to create a DAG of the lineage. Networkx graphs provide programmatic access to data lineage providing rich opportunities to analyze data lineage.
  3. Use Plotly to visualize the graph with tool tips and other rich annotations. Plotly provides a number of features to provide rich graphs with tool tips, color coding and weights based on different attributes of the graph.

You can get a data lineage graph with less than 10 lines of Python code in a Jupyter Notebook.

Right now data-lineage supports postgres and support for more databases is planned.

Please give it a try if you need data lineage for your work and provide feedback.