Analyze Access Permissions for AWS Glue and Lake Formation

Rajat Venkatesh12/9/2019 2 Min Read

AWS Lake Formation permissions control access to data sets in your data lake in AWS at a table and column level granularity. For a quick primer, read the Lake Permissions by Example blog post. After setting up access policies in AWS Lake Formation, it is necessary to regularly check that the policies are up to date and are not leaking any unintended privileges. In this article, two utilities - lakecli and piicatcher - are combined to automatically check against privilege leaks in a data lake built on Glue, Lake Formation, and S3 with a single SQL statement. piicatcher tags columns that contain critical data like PII & PHI in the AWS Glue catalog.



lakecli provides a SQL interface to find all privileged users. After review, scanning the list of users through a scheduled automated process to ensure that there is no leak in privileged access.

“Antique key and locks” licensed under CC0 1.0


Prerequisites

The article assumes the AWS account has a data lake setup using the following technologies :

  • AWS Glue
  • AWS Lake Formation
  • AWS Athena
  • AWS Cloudtrail
  • AWS Athena is used by data analysts and scientists to access the data. If you use another product, then ensure that it uses Glue catalog as the metadata store.


Check Secure Data Lake Tutorial to set up a secure data lake using New York City Taxi and Limousine Commission (TLC) Trip Record Data.

Discover and categorize data

The first step to analyze access is to categorize data sets. Typically access policies are determined for every category. Every business has its own categories and patterns to recognize it. Common categories of data are: Personally Identifiable Information (PII) Protected Health Information (PHI) Business specific critical information like sales and financial data.


PIICatcher

Run PiiCatcher to discover PII data in the NYC Trip data set.

> piicatcher aws -r <region> --list-all


PIICatcher finds PII data in taxidata.csv_misc.


Augment AWS Glue Catalog with categories

PiiCatcher finds PII data, but that is not sufficient. Permanently tagging the tables and columns with the category allows other utilities to use them for analysis in the future. AWS Glue Catalog allows custom metadata storage in a field called Parameters for every column. For example, the columns for taxidata.csv_misc are:

'Columns': [
    {
        'Name': 'locationid',
        'Type': 'bigint'
    },
    {
        'Name': 'borough',
        'Type': 'string'
    },
    {
        'Name': 'zone',
        'Type': 'string'
    },
    {
        'Name': 'service_zone',
        'Type': 'string'
    }
]


PiiCatcher adds a parameter to store the type of PII data found in the column when run with the command below:


piicatcher aws -r <region> --catalog-format glue


After the run, piicatcher has added a new parameter with key PII and value as the category of PII. The same table now has the following metadata:


'Columns': [
    {
        'Name': 'locationid',
        'Type': 'bigint'
    },
    {
        'Name': 'borough',
        'Type': 'string',
        'Parameters': {
            'PII': 'PiiTypes.ADDRESS'
        }
    },
    {
        'Name': 'zone',
        'Type': 'string',
        'Parameters': {
            'PII': 'PiiTypes.ADDRESS'
        }
    },
    {
        'Name': 'service_zone',
        'Type': 'string',
        'Parameters': {
            'PII': 'PiiTypes.ADDRESS'
        }
    }
]


The columns with PII parameters can now be used by lakecli to analyze privilege access.

Access Table Properties in Information Schema

lakecli provides an information schema for AWS Lake Formation. The information schema provides a SQL interface to the Glue catalog and Lake Formation permissions for easy analysis.

The column table has information on which columns have PII data.


\r:iamdb> SELECT ORDINAL, TABLE_NAME, COLUMN_NAME, PII FROM COLUMNS;


table_privileges stores all the privileges defined on tables.

\r:iamdb> SELECT * FROM TABLE_PRIVILEGES where principal like 'user%';




The query below joins these tables and lists the principals who have access to columns with PII data:

SELECT DISTINCT `PRINCIPAL` FROM `COLUMNS` INNER JOIN `TABLE_PRIVILEGES` ON
  `COLUMNS`.`TABLE_SCHEMA` = `TABLE_PRIVILEGES`.`SCHEMA_NAME` AND
  `COLUMNS`.`TABLE_NAME` = `TABLE_PRIVILEGES`.`TABLE_NAME`
WHERE `COLUMNS`.`PII` IS NOT NULL AND
  `TABLE_PRIVILEGES`.`PERMISSION` IN ('ALL', 'SELECT')
ORDER BY 1



Reviewing user lists ensures that the appropriate principals have access to PII data. These discovery and analysis methods apply to all types of critical data.


Continuous Access Analysis

The above process can be automated using scheduling systems like cron or Apache Airflow by reviewing privileged principals against a canonical list.

Conclusion

This article described how two utilities - lakecli and piicatcher - can be combined to automatically check against privilege leaks in AWS Data Lake built on AWS Glue, Lake Formation, and S3. If Access Analyzer is of interest to you, get in touch through the chat widget


Similar Posts

Get in touch for bespoke support for PII Catcher

We can help discover, manage and secure sensitive data in your data warehouse.