Parsing SQL queries provide superpowers for monitoring data health. This post elaborates on how to get started with parsing SQL for data observability.
The query history of a data warehouse is a rich source of information to analyze data usage within your organization. Assessing query history can track many aspects of data observability. For instance, carrying out such analysis can help to understand the following:
These statistics also help to automate engineering tasks like:
SQL 2016 meets ISO/IEC standards. However, every database implements the standard differently, using different function names for similar operations, with unique extensions to access specific custom features. Therefore, there cannot be a singular SQL parser for every popular database or data warehouse.
Regular expressions are a popular approach to extracting information from SQL statements. However, these quickly become too complex to handle features like WITH, sub-queries, windows clauses, aliases, and quotes. sqlparse is a popular Python package that uses regular expressions to parse SQL.
Alternatively, parser/optimizer platforms like ANTLR or Apache Calcite help to reduce the effort to implement the SQL dialect of your choice.
Some popular open-source databases and data warehouses are:
libpg_query extracts the parser (written in C) from the postgres project and packages it as a stand-alone library. This library is wrapped in other languages by other projects like:
Parser/Optimizer platforms implement the SQL language features and allow customization as a first-class feature. Two popular open-source projects are:
Apache Calcite allows customizations at various points of the parsing process.
Apache Calcite also provides visitors for traversing the SQL execution plan. A Visitor pattern is an algorithm to traverse a SQL plan.
There are many abandoned open-source SQL parsers. The first step is to use a project that will be supported in the future. For popular databases such as Postgres and MySQL/MariaDB, there are parsers available in multiple programming languages.
Most teams do not create a parser from scratch. A popular option is to use the Postgres parser and then add custom SQL syntax. AWS Redshift, Vertica, and DuckDB are examples. Use a Postgres SQL parser to parse the query history of these databases to parse the majority of the queries. Many queries, such as UNLOAD, will fail to parse in AWS Redshift. If parsing the variants is also required in your context, consider modifying the projects to accept the custom grammar OR use a platform like Apache Calcite.
There is a demand for SQL parsers to build reports on databases or data warehouse usage. There are several good open-source projects. However, there is a steep learning curve to use these projects, and a project may not always fit your specific requirements. Need help with parsing query history? Get in touch
The first step is to understand what data governance is. Data Governance is an overloaded term and means different things to different people. It has been helpful to define Data Governance based on the outcomes it is supposed to deliver.
Get in touch for bespoke support for PII Catcher
We can help discover, manage and secure sensitive data in your data warehouse.