Download Queries from a Data Warehouse

Overview

In query parsing recipe, the queries were read from a JSON file. This recipe shows how to download queries from AWS Redshift and Snowflake. Once these queries are downloaded, they can be parsed and lineage created using the query parsing recipe.

AWS Redshift

Amazon Redshift has many system tables and views that contain information about how the system is functioning.

STL_QUERYTEXT captures the query text of SQL commands.

STL_QUERY returns execution information about a database query.

The following SQL queries are captured in these two tables.

SELECT, SELECT INTO
INSERT, UPDATE, DELETE
COPY
UNLOAD
VACUUM, ANALYZE
CREATE TABLE AS (CTAS)

STL_QUERY stores a part of the query text. To reconstruct the SQL stored in the text column of STL_QUERYTEXT, run a SELECT statement to create SQL from 1 or more parts in the text column. Before running the reconstructed SQL, replace any (\n) special characters with a new line. The result of the following SELECT statement is rows of reconstructed SQL in the query_statement field.

Use the following query to capture queries for a specific time range.

WITH query_sql AS (
  SELECT
    query,
    LISTAGG(text) WITHIN GROUP (ORDER BY sequence) AS sql
  FROM stl_querytext
  GROUP BY 1
)

SELECT
  q.query,  userid,  xid,  pid,  starttime,  endtime,
    DATEDIFF(milliseconds, starttime, endtime)/1000.0 AS duration,
    TRIM(database) AS database,
  (CASE aborted WHEN 1 THEN TRUE ELSE FALSE END) AS aborted,
  sql
FROM
  stl_query q JOIN query_sql qs ON (q.query = qs.query)
WHERE
  endtime between '2018-12-01 00:00:00' and '2018-12-14 00:00:00'
ORDER BY starttime;

Store query results in a CSV file

Store the results of the above query using psql. Run the command below where query.sql contains the query in the previous section.

PGPASSWORD=<password> psql -h <host> -U <user> -p 5439 \
    -f query.sql -F ',' -o queries.csv

Pro-tip: Persist System Tables in AWS Redshift

System tables in AWS Redshift store data only for a few days. Persist the these tables to normal tables in the same instance or in another database. AWSLabs provides sample scripts to persist system tables in AWS Redshift

Snowflake

Snowflake stores queries in the information schema in QUERY_HISTORY table.

The following query returns queries run in the last 7 days

select *
  from table(information_schema.query_history(
    end_time_range_start=>to_timestamp_ltz('2017-12-4 12:00:00.000 -0700'),
    end_time_range_end=>to_timestamp_ltz('2017-12-4 12:30:00.000 -0700')));

Python Connector

In a Jupyter Notebook, use the snowflake-python-connector to connect to snowflake and store the queries in an array. Then use the data-lineage library to visualize or process the lineage.

Steps

Install Snowflake Python Connector
Connect to Snowflake using one of the authentication methods.
Run query and store results in an array.

import snowflake.connector

query = """
select query_text
  from table(information_schema.query_history(
    end_time_range_start=>to_timestamp_ltz('2017-12-4 12:00:00.000 -0700'),
    end_time_range_end=>to_timestamp_ltz('2017-12-4 12:30:00.000 -0700')));
"""

connection = snowflake.connector.connect(...)

queries = []
with connection.get_cursor() as cursor:
  cursor.execute(query)
  row = cursor.fetchone()

  while row is not None:
    queries.append(row[0])

Overview​

AWS Redshift​

Store query results in a CSV file​

Pro-tip: Persist System Tables in AWS Redshift​

Snowflake​

Python Connector​

Steps​