Release data-lineage - An Open Source Python Project

Rajat Venkatesh3/15/2020 1 Min Read

Today we released data-lineage, an open-source Python project to visualize and analyze data lineage. Developing this project required collaboration with data teams working on various data governance initiatives over the last couple of years.

There are a lot of open-source and commercial tools to capture data lineage. However, there are two main problems for data engineers: The projects require a lot of effort to get started and maintain. Requires constant discipline in capturing and sending all the metadata.

Both these factors result in incomplete projects and lost opportunities to improve performance, ROI, and data quality. data-lineage solves these problems by choosing the following goals:

  • providing fast access to data lineage
  • simple setup
  • analysis of data lineage using a graph library

The following features help to achieve these goals:

  • Generate data lineage from query history. Most databases maintain query history for a few days. Therefore the setup costs of infrastructure to capture and store metadata are minimal.
  • Use the NetworkX graph library to create a DAG of the lineage. NetworkX graphs provide programmatic access to data lineage, providing rich opportunities to analyze data lineage.
  • Use Plotly to visualize the graph with tooltips and other rich annotations. Plotly provides a number of features to create informative visualizations with tooltips, color coding, and weights based on different attributes.

You can get a data lineage graph with less than ten lines of Python code in a Jupyter Notebook. Currently, data-lineage supports postgres, with support for additional databases on the way. Try it out if you require data lineage for your work, and provide us with your feedback!


Similar Posts

Get in touch for bespoke support for PII Catcher

We can help discover, manage and secure sensitive data in your data warehouse.