Data Lineage on Redshift
Data Lineage tracks data transformation through all systems. It is important for data governance and security. In Data warehouses and data lakes, a team of data engineers maintain a canonical set of base tables. The source of data of these base tables maybe events from the product, logs or third-party data from SalesForce, Google Analytics or Segment.
Data analysts and scientists use the base tables for their reports and machine learning algorithms. They may create derived tables to help with their work. It is not uncommon for a data warehouse or lake to have hundreds or thousands of tables. In such a scenario it is important to use automation and visual tools to track data lineage.
This post describes automated visualization of data lineage in AWS Redshift from query logs of the data warehouse. The techniques are applicable to other technologies as well.
A system of record of all activity in databases is a prerequisite for any type of analysis. For example, AWS Redshift has many system tables and views that record all the activity in the database. Since these tables retain data for a limited time, it is important to persist the data. AWS provides scripts to store the data in tables within Redshift itself. For performance analysis the query log stored in STL_QUERY and STL_QUERYTEXT are the most important.
Tokern reads and processes the records in STL_QUERY & STL_QUERYTEXT at regular intervals. It adds the following information for every query:
- Type of query such as SELECT, DML, DDL, COPY, UNLOAD etc
- Source tables & columns which are read by the query if applicable.
- Source files in S3 for COPY queries.
- Target table & columns where the data was loaded if applicable.
- Target files in S3 for UNLOAD queries.
Tokern uses the system of record to build a network graph for every table & pipeline. An example for infallible_galois is visualized below.
In the network graph, data moves from left to right. Every node (or circle) represents a table. There is an edge (left to right) to a node if the data load reads from that table. A table can be loaded with data from many tables. For example, the data load for hopeful_matsumoto reads data from hungry_margulis. The graph can be analyzed programmatically or used to create interactive charts to help data engineers glean actionable information.
The security engineers at Company S used a scanning tool to find all tables and columns that stored sensitive data like PII, financial and business sensitive data. They found that there were many more copies than anticipated. An immediate goal was to reduce the number of copies to reduce the security vulnerability surface area. To achieve the goal they had to:
- Identify the owners of the copy.
- Understand their motivation to create copies.
- Work with owners to eliminate copies or use masked copies of the data.
The challenges to achieve these goals were:
- Determine the owners of the data.
- Number of conversations required to understand the need for copies and agree on a workaround.
Due to the sheer number of tables, the task was daunting and required a automation tool.
Tokern provided security team with actionable information to discover
- Owners of copies of data
- A profile consisting of the queries used to create the table and read it.
The profile helped the security engineers prepare for the conversation with the owners. In a number of occasions, duplicate tables were eliminated. In other cases, the workflow was changed to eliminate temporary tables for intermediate data.
In the example below for upbeat_ellis, two intermediate tables with sensitive data were eliminated from the workflow.
Another example of a network graph for crazy_terseshkova is shown below. As seen, the data path is much more complicated with close to a hundred tasks required to eventually load data into the table. The graph is interactive and can be panned & zoomed to focus on specific parts. Metadata of tasks such as start time, run time and table names are shown by hovering over the nodes in the network graph to deal with such complicated data pipelines.
Security Engineers need data lineage and related automation tools to manage database security. All databases provide a workload system of record. Tools such as Tokern Database Security Platform can use this information to visualize data lineage as well as use rules to automate checks.
If such a tool is useful to you, get in touch with us through the chat widget.