AWS Lake Formation And Glue Access Analyzer

Rajat Venkatesh — 12/09/20192 Min Read — In AWS S3, AWS Glue, AWS Lake Formation, AWS Athena, Data Catalog

AWS Lake Formation permissions control access to data sets in your data lake in AWS at a table and column level granularity. For a quick primer, read Lake Permissions by Example blog post.

Once access policies are setup in AWS Lake Formation, it is important to regularly check that the policies are up to date and are not leaking any unintended privileges. In this article, two utilities - lakecli and piicatcher - are combined to automatically check against privilege leak in a data lake built on Glue, Lake Formation and S3 with a single SQL statement.

piicatcher tags all columns that contain critical data like PII & PHI in the AWS Glue catalog.

Checkout Tokern PIICatcher to scan datasets for PII and PHI

lakecli provides a SQL interface to find all privileged users.

Once the list is reviewed, it can be used to ensure that there is no leak in privileged access through a scheduled automated process.

lock

"Antique key and lock" is licensed under CC0 1.0

Prerequisites

The article assumes the AWS account has a data lake setup using the following technologies :

  • AWS Glue
  • AWS Lake Formation
  • AWS Athena
  • AWS Cloudtrail

AWS Athena is used by data analysts and scientists to access the data. If you use another product, then ensure that it uses Glue catalog as the metadata store.

Check Secure Data Lake Tutorial to setup a secure data lake using New York City Taxi and Limousine Commission (TLC) Trip Record Data.

Discover and categorize data

The first step to analyze access is to categorize data sets. Typically access policies are determined for every category. Every business has its own categories and patterns to recognize it. Common categories of data are:

  • Personally Identifiable Information (PII)
  • Protected Health Information (PHI)
  • Business specific critical information like sales and financial data.

PiiCatcher

Run PiiCatcher to discover PII data in the NYC Trip data set.

> piicatcher aws -r <region> --list-all
SchemaTableColumnHas PII
taxidatacsv_miscborough1
taxidatacsv_misczone1
taxidatacsv_miscservice_zone1
taxidatacsv_trip_datadispatching_base_num0
taxidatacsv_trip_datapickup_datetime0
taxidatacsv_trip_datadropoff_datetime0
taxidatacsv_trip_datahvfhs_license_num0

PiiCatcher finds PII data in taxidata.csv_misc.

Augment AWS Glue Catalog with categories

PiiCatcher finds PII data but that is not sufficient. It is important to tag the tables and columns with the category permanently so that other utilities can use them for analysis. AWS Glue Catalog allows custom metadata to be stored in a field called Parameters for every column. For example, the columns for taxidata.csv_misc are:

'Columns': [
    {
        'Name': 'locationid',
        'Type': 'bigint'
    },
    {
        'Name': 'borough',
        'Type': 'string'
    },
    {
        'Name': 'zone',
        'Type': 'string'
    },
    {
        'Name': 'service_zone',
        'Type': 'string'
    }
]

PiiCatcher adds a parameter to store the type of PII data found in the column when run with the command below:

piicatcher aws -r <region> --catalog-format glue

After the run, piicatcher has added a new parameter with key PII and value as the category of PII. The same table now has the following metadata:

'Columns': [
    {
        'Name': 'locationid',
        'Type': 'bigint'
    },
    {
        'Name': 'borough',
        'Type': 'string',
        'Parameters': {
            'PII': 'PiiTypes.ADDRESS'
        }
    },
    {
        'Name': 'zone',
        'Type': 'string',
        'Parameters': {
            'PII': 'PiiTypes.ADDRESS'
        }
    },
    {
        'Name': 'service_zone',
        'Type': 'string',
        'Parameters': {
            'PII': 'PiiTypes.ADDRESS'
        }
    }
]

The columns with PII parameter can now be used by lakecli to analyze privilege access.

Access Table Properties in Information Schema

lakecli provides an information schema for AWS Lake Formation. The information schema provides a SQL interface to the Glue catalog and Lake Formation permissions for easy analysis.

The column table has information on which columns have PII data.

\r:iamdb> SELECT ORDINAL, TABLE_NAME, COLUMN_NAME, PII FROM COLUMNS;
ordinaltable_namecolumn_namepii
1csv_misclocationidnull
2csv_miscboroughPiiTypes.ADDRESS
3csv_misczonePiiTypes.ADDRESS
4csv_miscservice_zonePiiTypes.ADDRESS
1csv_trip_datadispatching_base_numnull
2csv_trip_datapickup_datetimenull
3csv_trip_datadropoff_datetimenull
4csv_trip_datapulocationidnull
5csv_trip_datadolocationidnull
6csv_trip_datasr_flagnull
7csv_trip_datahvfhs_license_numnull

table_privileges stores all the privileges defined on tables.

\r:iamdb> SELECT * FROM TABLE_PRIVILEGES where principal like 'user%';
idschema_nametable_nameprincipalpermission
1taxidatacsv_miscuser/datalake_enggALTER
2taxidatacsv_miscuser/datalake_enggDELETE
3taxidatacsv_miscuser/datalake_enggINSERT
4taxidatacsv_miscuser/datalake_enggSELECT
11taxidatacsv_miscuser/lakeadminALL

The query below joins these tables and lists the principals who have access to columns with PII data.

SELECT DISTINCT `PRINCIPAL` FROM `COLUMNS` INNER JOIN `TABLE_PRIVILEGES` ON
  `COLUMNS`.`TABLE_SCHEMA` = `TABLE_PRIVILEGES`.`SCHEMA_NAME` AND
  `COLUMNS`.`TABLE_NAME` = `TABLE_PRIVILEGES`.`TABLE_NAME`
WHERE `COLUMNS`.`PII` IS NOT NULL AND
  `TABLE_PRIVILEGES`.`PERMISSION` IN ('ALL', 'SELECT')
ORDER BY 1
principal
IAM_ALLOWED_PRINCIPALS
role/LakeFormationWorkflowRole
user/datalake_engg
user/lakeadmin

This list can be reviewed to ensure that the right principals have access to PII data. A similar process of discovery and analysis can be extended to all types of critical data.

Continuous Access Analysis

The above process can be automated using scheduling systems like cron or Apache Airflow to automatically review privileged principals against a canonical list.

Conclusion

This article described how two utilities - lakecli and piicatcher - can be combined to automatically check against privilege leak in AWS Data Lake built on AWS Glue, Lake Formation and S3. If Access Analyzer is of interest to you, get in touch through the chat widget.