Skip to main content

PII Detectors

PIICatcher uses two main strategies to detect PII data:

  1. Metadata: Scan column and table names
  2. Data: Scan data stored in columns

Metadata and data can be scanned using multiple techniques such as regular expressions or NLP processors. PIICatcher supports plugins to create new detectors that process metadata or data.

PIICatcher has two builtin detectors:

  1. ColumnNameRegexDetector: Uses regular expressions to detect column names that may contain PII.
  2. DatumRegexDetector: Uses regular expressions to detect PII in sample data in columns.

piicatcher_spacy uses Spacy to detect PII in column data.

Scan Data Warehouse Metadata

Data engineers use descriptive names for tables and columns to help users understand the data stored in them. Therefore, the names of tables and columns provide clues to the type of data stored. For example,

  • first_name, last_name, full_name or name maybe used to store the name of a person.
  • ssn or social_security maybe used to store US SSN numbers.
  • phone or phone_number maybe used to store phone numbers.

All data warehouses provide an information schema to extract schema, table and column information. For example, the following query can be used to get metadata from Snowflake:

SELECT
lower(c.column_name) AS col_name,
c.comment AS col_description,
lower(c.data_type) AS col_type,
lower(c.ordinal_position) AS col_sort_order,
lower(c.table_catalog) AS database,
lower({cluster_source}) AS cluster,
lower(c.table_schema) AS schema,
lower(c.table_name) AS name,
t.comment AS description,
decode(lower(t.table_type), 'view', 'true', 'false') AS is_view
FROM
{database}.{schema}.COLUMNS AS c
LEFT JOIN
{database}.{schema}.TABLES t
ON c.TABLE_NAME = t.TABLE_NAME
AND c.TABLE_SCHEMA = t.TABLE_SCHEMA

Regular expressions can be used to match table or column names. For example, the regular expression below detects a column that stores social security numbers:

^.*(ssn|social).*$

Scan data stored in columns

The second strategy is to scan the data stored in columns. Within this strategy the two sub-strategies are:

The major disadvantage of this strategy is that NLP libraries are compute intensive. It can be prohibitively expensive to run NLP scanners even on moderately sized tables let alone tables of millions or billions of rows. Therefore, a random sample of rows should be scanned. Choosing a random sample is harder than expected. Luckily, a few databases provide builtin functions to choose a random sample. For example, the Snowflake query below provides a random sample:

select {column_list} from {schema_name}.{table_name} TABLESAMPLE BERNOULLI (10 ROWS)

Once the rows have been extracted, then can be processed using regular expressions or NLP libraries to detect PII content.