PIICatcher Update Release: v0.21.0

Nicole6/23/2023 3 Min Read

What is PIICatcher?

PIICatcher is an open-source project that automatically detects Personally Identifiable Information (PII) in its supported databases and stores this information in a data catalog. Used by data engineers to help discover sensitive data, it is compatible with popular databases such as Snowflake and AWS Redshift. For more information on existing PIICatcher features or additional support with this software, head to https://tokern.io/piicatcher/.


This update announces key bug fixes and improvements to PIICatcher

Release Notes: PIICatcher V0.21.0


New Features


  1. BigQuery

SqlAlchemy-BigQuery and Google-Cloud-BigQuery-Storage are added as project dependencies to support BigQuery in PIICatcher. To add BigQuery as a data source, users will have to provide a memorable name to the data source, their client email as the username, the specific project id, and the local path/directory to their credentials file (JSON).


  1. Column Name Regex Detector

This detector is used during metadata scanning. To increase the sensitivity of detection for addresses, we separated the detection of Zip Code and PO Box from street addresses. Regex for US Social Security Number detection is updated for a more comprehensive scan. Phone and Credit Card PII types have also been added to this detector.


  1. Datum Regex Detector

Previously, the datum regex detector was only able to detect the following PII types: phone, email, credit card, and address. After updating the regex library, we are now able to detect US Social Security numbers, zip codes, and PO boxes.


Bug Fixes and Improvements to Existing Features

  1. Regex Library

CommonRegex library was previously used in PiiCatcher’s regex detector. However, the package has not been updated for quite a while, and in order to increase the PII support for PiiCatcher, we decided to change the regex library to the CommonRegex-Improved library.


  1. Redshift scanning

We were previously using TABLESAMPLE and BERNOULLI distribution to scan Redshift, which was not supported by Redshift. By using RANDOM() instead, PIICatcher is now able to scan and sample tables on Redshift.


  1. Deep Scan

The shallow scan will be activated during the deep scan for increased accuracy and identification of PII types as per requested in GitHub issue 68.


  1. Dependabot

PIICatcher is currently using dependabot to help ensure our dependencies are kept up-to-date. This release includes an update to the requests from version 2.28.1 to 2.31.0 and tornado from version 6.2 to 6.3.2.


Other Updates

Aside from the PIICatcher project, we have also added in BigQuery support for DBCat (issue 186) which has been published as a package on PyPi as version 0.14.0.


Here are other feature updates you can expect in the next update.

  1. Athena support for PIICatcher

  2. Improve MySQL query in DBCat for larger relational databases (issue 190)


Community

We all hang out on Slack. Come as you are, say hi, ask questions, help friends, and honestly, geek out! Alternatively, you can post any of your questions on Github Discussions.

Thank you to everyone who reported the above issues and helped us mitigate them. Are you currently facing any problems with PIICatcher? Open an issue on our Github Repository! We always welcome feedback and suggestions for future improvements.

For now, we hope these updates will continue to develop PIICatcher as a tool to manage and protect your data. See you soon!


Get in touch for bespoke support for PII Catcher

We can help discover, manage and secure sensitive data in your data warehouse.