Open-Source SQL Parsers

Rajat Venkatesh10/1/2021 3 Min Read

Parsing SQL queries provide superpowers for monitoring data health. This post elaborates on how to get started with parsing SQL for data observability.

The query history of a data warehouse is a rich source of information to analyze data usage within your organization. Assessing query history can track many aspects of data observability. For instance, carrying out such analysis can help to understand the following:

  • Popular tables and columns
  • Unused tables and columns
  • Column-level lineage
  • Freshness of data

These statistics also help to automate engineering tasks like:

  • Backup and Disaster Recovery
  • Triaging Data Quality issues
  • Tracking sensitive data and its use



Challenges and Approaches

SQL 2016 meets ISO/IEC standards. However, every database implements the standard differently, using different function names for similar operations, with unique extensions to access specific custom features. Therefore, there cannot be a singular SQL parser for every popular database or data warehouse.

Regular expressions are a popular approach to extracting information from SQL statements. However, these quickly become too complex to handle features like WITH, sub-queries, windows clauses, aliases, and quotes. sqlparse is a popular Python package that uses regular expressions to parse SQL.

Alternatively, parser/optimizer platforms like ANTLR or Apache Calcite help to reduce the effort to implement the SQL dialect of your choice.



Open Source Parsers

Some popular open-source databases and data warehouses are:

MySQL/MariaDB

  • Pingcap parser is a MySQL parser in Go.
  • SQL Parser in phpmyadmin is a validating SQL lexer and parser focused on MySQL dialect.

Postgres

libpg_query extracts the parser (written in C) from the postgres project and packages it as a stand-alone library. This library is wrapped in other languages by other projects like:

Multiple Engines

  • queryparser implements Apache Hive, Presto/Trino, and Vertica dialects.
  • zetasql implements BigQuery, Spanner, and Dataflow dialects.



Generic Parsers



Platforms

Parser/Optimizer platforms implement the SQL language features and allow customization as a first-class feature. Two popular open-source projects are:

  • Apache Calcite: a parser/optimizer used with popular databases and query engines like Apache Hive, BlazingSQL, and many others.
  • JSQLParser can parse multiple SQL dialects like MySQL, Postgres, and Oracle. Modifications to the grammer will enable support for other SQL dialects.

Apache Calcite allows customizations at various points of the parsing process.

  • Parser rules can be changed to support custom syntax.
  • Conventions such as quotes versus double quotes and case sensitivity.
  • Add optimizer rules.

Apache Calcite also provides visitors for traversing the SQL execution plan. A Visitor pattern is an algorithm to traverse a SQL plan.



Practical tips for Getting Started

There are many abandoned open-source SQL parsers. The first step is to use a project that will be supported in the future. For popular databases such as Postgres and MySQL/MariaDB, there are parsers available in multiple programming languages.

What if there is no parser for your database?

Most teams do not create a parser from scratch. A popular option is to use the Postgres parser and then add custom SQL syntax. AWS Redshift, Vertica, and DuckDB are examples. Use a Postgres SQL parser to parse the query history of these databases to parse the majority of the queries. Many queries, such as UNLOAD, will fail to parse in AWS Redshift. If parsing the variants is also required in your context, consider modifying the projects to accept the custom grammar OR use a platform like Apache Calcite.



Conclusion

There is a demand for SQL parsers to build reports on databases or data warehouse usage. There are several good open-source projects. However, there is a steep learning curve to use these projects, and a project may not always fit your specific requirements. Need help with parsing query history? Get in touch


Similar Posts

Get in touch for bespoke support for PII Catcher

We can help discover, manage and secure sensitive data in your data warehouse.