Scan AWS S3 using Athena and Glue

Command Options

OptionDefaultDescription
access-keyNoneAWS Access Key [required]
secret-keyNoneAWS Secret Key [required]
staging-dirNoneS3 Staging Directory for Athena results
regionNoneAWS Region [required]
scan-typeshallowOne of deep, shallow. Deep scan checks sample data. Shallow scan checks column names using regular expression
list-allFalseList all columns. By default only columns with PII information is listed
schemaNoneScan only schemas matching the pattern. Refer to Include/Exclude Lists.
exclude-schemaNoneDo not scan any schemas matching the pattern. Refer to Include/Exclude Lists.
tableNoneScan only tables matching the pattern. Refer to Include/Exclude Lists.
exclude-tableNoneDo not scan any tables matching the pattern. Refer to Include/Exclude Lists.

Command Line

piicatcher aws --help
Usage: piicatcher aws [OPTIONS]

Options:
  -a, --access-key TEXT           AWS Access Key   [required]
  -s, --secret-key TEXT           AWS Secret Key  [required]
  -d, --staging-dir TEXT          S3 Staging Directory for Athena results
                                  [required]
  -r, --region TEXT               AWS Region  [required]
  -c, --scan-type [deep|shallow]  Choose deep(scan data) or shallow(scan
                                  column names only)
  --list-all                      List all columns. By default only columns
                                  with PII information is listed
  -n, --schema TEXT               Scan only schemas matching schema.
  -N, --exclude-schema TEXT       Do not scan any schemas matching the schema
                                  pattern.
  -t, --table TEXT                Dump only tables matching table.
  -T, --exclude-table TEXT        Do not dump any tables matching the table
                                  pattern.
  --help                          Show this message and exit.

Configuration File

[aws]
access_key="..."
secret_key="..."
staging_dir="..."
region="..."
scan_type="[deep|shallow]"
list_all=True|False
schema=("<schema>",["<schema2>", ...])
exclude_schema=("<schema>",["<schema2>", ...])
table=("<schema>",["<schema2>", ...])
exclude_table=("<schema>",["<schema2>", ...])