Data Catalog

Data Catalog Options

PiiCatcher can write the output of the scan to the following storage options:

  • Terminal: Write an ascii table to terminal. This is useful for small tables when running in test mode.
  • File: Write the catalog to a file in JSON format
  • Database: Store the metadata in a database catalog. Only MySQL is supported as a data catalog.
  • AWS Glue: Store the metadata as parameters in Table objects in AWS Glue. This option is supported for AWS Glue tables only.

Terminal

Terminal is the default option. Note This option should not be used in production and scheduled runs.

Command line option: --catalog-format ascii_table

Example Output:

โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ schema โ”‚ table โ”‚ column โ”‚ has_pii โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ main โ”‚ full_pii โ”‚ a โ”‚ 1 โ”‚
โ”‚ main โ”‚ full_pii โ”‚ b โ”‚ 1 โ”‚
โ”‚ main โ”‚ no_pii โ”‚ a โ”‚ 0 โ”‚
โ”‚ main โ”‚ no_pii โ”‚ b โ”‚ 0 โ”‚
โ”‚ main โ”‚ partial_pii โ”‚ a โ”‚ 1 โ”‚
โ”‚ main โ”‚ partial_pii โ”‚ b โ”‚ 0 โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ

File

Command line option: --catalog-format json --catalog-file <file name>

Example output:

[
{
"has_pii": false,
"name": "testSchema",
"tables": [
{
"columns": [
{
"name": "c1",
"pii_types": []
},
{
"name": "c2",
"pii_types": [
{
"__enum__": "PiiTypes.LOCATION"
}
]
}
],
"has_pii": false,
"name": "t1"
}
]
}
]

Database

PiiCatcher creates 4 tables in the data catalog:

  • DbSchemas
  • DbTables
  • DbColumns
  • DbFiles

DbSchemas, DbTables and DbColumns are used when a database is scanned.

DbFiles are used when files are scanned.

The schema of these tables are:

DbSchemas

ColumnDescription
idInteger. Auto Increment. Primary Key
nameText. Name of the schema

DbTables

ColumnDescription
idInteger. Auto Increment. Primary Key
nameText. Name of the Table
schema_idForeign Key to DbSchemas table

DbColumns

ColumnDescription
idInteger. Auto Increment. Primary Key
nameText. Name of the Table
pii_typeText. Json serialized array of PIITypes
table_idForeign Key to DbTables table

DbFile

ColumnDescription
idInteger. Auto Increment. Primary Key
full_pathText. Absolute Path of the file
mime_typeText. Mime Type is determined by Python Magic module
pii_typesText. Json serialized array of PIITypes

AWS Glue

AWS Glue can be used only when tables in AWS Glue are scanned.

Command line option: --catalog-format glue

For example output, check AWS Glue Analyzer blog post.