Open-Source SQL Parsers

Rajat Venkatesh — 10/1/2021 — 3 Min Read

Parsing SQL queries provide superpowers for monitoring data health. This post elaborates on how to get started with parsing SQL for data observability.

The query history of a data warehouse is a rich source of information to analyze data usage within your organization. Assessing query history can track many aspects of data observability. For instance, carrying out such analysis can help to understand the following:

Popular tables and columns

Unused tables and columns

Column-level lineage

Freshness of data

These statistics also help to automate engineering tasks like:

Backup and Disaster Recovery

Triaging Data Quality issues

Tracking sensitive data and its use

Challenges and Approaches

SQL 2016 meets ISO/IEC standards. However, every database implements the standard differently, using different function names for similar operations, with unique extensions to access specific custom features. Therefore, there cannot be a singular SQL parser for every popular database or data warehouse.

Regular expressions are a popular approach to extracting information from SQL statements. However, these quickly become too complex to handle features like WITH, sub-queries, windows clauses, aliases, and quotes. sqlparse is a popular Python package that uses regular expressions to parse SQL.

Alternatively, parser/optimizer platforms like ANTLR or Apache Calcite help to reduce the effort to implement the SQL dialect of your choice.

Open Source Parsers

Some popular open-source databases and data warehouses are:

MySQL/MariaDB

Pingcap parser is a MySQL parser in Go.
SQL Parser in phpmyadmin is a validating SQL lexer and parser focused on MySQL dialect.

Postgres

libpg_query extracts the parser (written in C) from the postgres project and packages it as a stand-alone library. This library is wrapped in other languages by other projects like:

Python: pglast
Ruby: pg_query
Golang: pg_query_go
JS: psql-parser in Node and pg-query-emscripten in the browser
Rust: pg_query.rs

Multiple Engines

queryparser implements Apache Hive, Presto/Trino, and Vertica dialects.
zetasql implements BigQuery, Spanner, and Dataflow dialects.

Generic Parsers

Python: sqlparse
Rust: sqlparser-rs
Python: mo-sql-parsing

Platforms

Parser/Optimizer platforms implement the SQL language features and allow customization as a first-class feature. Two popular open-source projects are:

Apache Calcite: a parser/optimizer used with popular databases and query engines like Apache Hive, BlazingSQL, and many others.
JSQLParser can parse multiple SQL dialects like MySQL, Postgres, and Oracle. Modifications to the grammer will enable support for other SQL dialects.

Apache Calcite allows customizations at various points of the parsing process.

Parser rules can be changed to support custom syntax.
Conventions such as quotes versus double quotes and case sensitivity.
Add optimizer rules.

Apache Calcite also provides visitors for traversing the SQL execution plan. A Visitor pattern is an algorithm to traverse a SQL plan.

Practical tips for Getting Started

There are many abandoned open-source SQL parsers. The first step is to use a project that will be supported in the future. For popular databases such as Postgres and MySQL/MariaDB, there are parsers available in multiple programming languages.

What if there is no parser for your database?

Most teams do not create a parser from scratch. A popular option is to use the Postgres parser and then add custom SQL syntax. AWS Redshift, Vertica, and DuckDB are examples. Use a Postgres SQL parser to parse the query history of these databases to parse the majority of the queries. Many queries, such as UNLOAD, will fail to parse in AWS Redshift. If parsing the variants is also required in your context, consider modifying the projects to accept the custom grammar OR use a platform like Apache Calcite.

Conclusion

There is a demand for SQL parsers to build reports on databases or data warehouse usage. There are several good open-source projects. However, there is a steep learning curve to use these projects, and a project may not always fit your specific requirements. Need help with parsing query history? Get in touch

Ready to dive in?

Why do we need Data Governance? What are its Challenges?

Rajat Venkatesh — 7/10/2020 - 1 Min Read

The first step is to understand what data governance is. Data Governance is an overloaded term and means different things to different people. It has been helpful to define Data Governance based on the outcomes it is supposed to deliver.

An Analysis of Data Security Incidents

Rajat Venkatesh — 1/3/2020 - 1 Min Read

Tokern analyzed the VCDB security incidents data set with a focus on insider threats and database breaches.

Data Lineage on Redshift

Rajat Venkatesh — 1/17/2023 - 3 Min Read

Use Tokern Lineage to visualize data lineage of data in Redshift. It is open source and simple but effective.

Get in touch for bespoke support for PII Catcher

We can help discover, manage and secure sensitive data in your data warehouse.