PIICatcher is an open-source project that automatically detects Personally Identifiable Information (PII) in its supported databases and stores this information in a data catalog. Used by data engineers to help discover sensitive data, it is compatible with popular databases such as Snowflake and AWS Redshift. For more information on existing PIICatcher features or additional support with this software, head to https://tokern.io/piicatcher/.
Release Notes: PIICatcher V0.21.0
SqlAlchemy-BigQuery and Google-Cloud-BigQuery-Storage are added as project dependencies to support BigQuery in PIICatcher. To add BigQuery as a data source, users will have to provide a memorable name to the data source, their client email as the username, the specific project id, and the local path/directory to their credentials file (JSON).
This detector is used during metadata scanning. To increase the sensitivity of detection for addresses, we separated the detection of Zip Code and PO Box from street addresses. Regex for US Social Security Number detection is updated for a more comprehensive scan. Phone and Credit Card PII types have also been added to this detector.
Previously, the datum regex detector was only able to detect the following PII types: phone, email, credit card, and address. After updating the regex library, we are now able to detect US Social Security numbers, zip codes, and PO boxes.
CommonRegex library was previously used in PiiCatcher’s regex detector. However, the package has not been updated for quite a while, and in order to increase the PII support for PiiCatcher, we decided to change the regex library to the CommonRegex-Improved library.
We were previously using TABLESAMPLE and BERNOULLI distribution to scan Redshift, which was not supported by Redshift. By using RANDOM() instead, PIICatcher is now able to scan and sample tables on Redshift.
The shallow scan will be activated during the deep scan for increased accuracy and identification of PII types as per requested in GitHub issue 68.
PIICatcher is currently using dependabot to help ensure our dependencies are kept up-to-date. This release includes an update to the requests from version 2.28.1 to 2.31.0 and tornado from version 6.2 to 6.3.2.
Aside from the PIICatcher project, we have also added in BigQuery support for DBCat (issue 186) which has been published as a package on PyPi as version 0.14.0.
Athena support for PIICatcher
Improve MySQL query in DBCat for larger relational databases (issue 190)
We all hang out on Slack. Come as you are, say hi, ask questions, help friends, and honestly, geek out! Alternatively, you can post any of your questions on Github Discussions.
Thank you to everyone who reported the above issues and helped us mitigate them. Are you currently facing any problems with PIICatcher? Open an issue on our Github Repository! We always welcome feedback and suggestions for future improvements.
For now, we hope these updates will continue to develop PIICatcher as a tool to manage and protect your data. See you soon!
PIICatcher is an open-source project that automatically detects Personally Identifiable Information (PII) in its supported databases and stores this information in a data catalog. The software is compatible with popular databases such as Snowflake and AWS Redshift. For more information on existing PIICatcher features or additional support with this software, head to https://tokern.io/piicatcher/.
Get in touch for bespoke support for PII Catcher
We can help discover, manage and secure sensitive data in your data warehouse.