AWS Lake Formation Permissions with Examples

Rajat Venkatesh — 11/26/20192 Min Read — In AWS S3, AWS Glue, AWS Lake Formation, AWS Athena

AWS Lake Formations helps to setup a secure data lake on AWS S3. One of the main goals of the product is Simplified Security Management. The central tenet to this goal is to define security, governance and audit policies in a single location. Lake Formation permissions go beyond IAM policies to manage secure access to databases, table and columns as well.

The previous blog post explained how to setup a data lake with NYC Trip Data.

fort

"fort" is licensed under CC0 1.0

This tutorial uses the same data lake and shows how to setup basic permissions for two personas: data engineer and data analyst.

Rest of the blog explains resources and permissions. Then it sets up two users and their permissions (data engineer and data analyst) to explain the concepts and how they help to secure the data lake.

Query and Analyze Permissions in AWS LakeFormation

Resources and Permissions

There are two types of resources in AWS Lake Formation:

  1. Metadata is stored in AWS Glue Data Catalog and consist of databases, tables and columns.
  2. Data is stored in AWS S3 locations or in databases.

Metadata

The table below describes the metadata permissions available for each of the resources in the catalog.

CatalogDatabaseTableColumn
CREATE_DATABASECREATE_TABLEALTERSELECT
ALTERDROP
DROPSELECT
INSERT
DELETE

Data

Data Location permissions are applicable to AWS S3 locations registered with Lake Formation. The only permission available as of this article is DATA_LOCATION_ACCESS. DATA_LOCATION_ACCESS allows users to create tables without the need to have IAM permissions for a S3 location.

Similarly, SELECT, INSERT and DELETE permissions are available only on registered S3 locations.

An important point is that Lake Formation users do not need IAM permissions to access tables in a registered data location. Lake Formation's permissions ensure secure access instead of AWS IAM permissions.

Note that for database locations such as RDS or Redshift the permissions are determined by the user specified in the JDBC connection string and the GRANT permissions in the database.

Permissions on NYC Trip Data data lake

Pre-requisites

  • AWS Glue, AWS Athena and AWS Lake Formation have been setup.
  • NYC Taxi Trip data set is available in the data lake in taxidata database.
  • lakeadmin is the AWS Lake Formation administrator.
  • lakecli and athenacli are setup and configured in your system.

Create Policies and Users

First create a few managed policies.

Navigate to AWS Console > IAM > Policies > Create Policy and create the following policies

  1. DataEngineerLakeFormationPolicy
{
    "Version": "2012-10-17",
    "Statement": [
            {
                "Effect": "Allow",
                "Action": [
                        "lakeformation:GetDataAccess",
                        "lakeformation:GrantPermissions",
                        "lakeformation:RevokePermissions",
                        "lakeformation:BatchGrantPermissions",
                        "lakeformation:BatchRevokePermissions",
                        "lakeformation:ListPermissions" ],
                "Resource": "*"
            }
    ]
}
  1. DataEngineerPassRole
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "PassRolePermissions",
            "Effect": "Allow",
            "Action": [
                "iam:PassRole"
            ],
            "Resource": [
                "arn:aws:iam::account-id:role/workflow_role"
            ]
        }
    ]
}
  1. DataAnalystGlueAccess
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "lakeformation:GetDataAccess",
                "glue:GetTable",
                "glue:GetTables",
                "glue:SearchTables",
                "glue:GetDatabase",
                "glue:GetDatabases",
                "glue:GetPartitions"
            ],
            "Resource": "*"
        }
    ]
}
  1. S3NycBucketRead is required because nyc-tlc is not registered as a data location and explicit IAM permissions are required.
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::nyc-tlc",
                "arn:aws:s3:::nyc-tlc/*"
            ]
        }
    ]
}

Navigate to AWS Console > IAM > Create User and create the following users with attached policies.

UserAttach Policies
datalake_engg
  • AmazonAthenaFullAccess
  • AWSGlueConsoleFullAccess
  • DataEngineerLakeFormationPolicy
  • DataEngineerPassRole
  • S3NycBucketRead
datalake_user
  • AWSAthenaFullAccess
  • DataAnalystGlueAccess
  • S3NycBucketRead

Register a data location

  1. Create an S3 bucket called prefix_datalake. Choose a unique prefix.
  2. In AWS Console, go to AWS Lake Formation > Databases > Create Database
  3. Create a database with the following details:
FieldValue
Nametaxilake
Locationprefix_datalake

Note that no explicit IAM permissions are setup for the users.

Data Engineer Setup

This section sets up the permission for datalake_engg. datalake_engg creates a table in taxilake from a table in taxidata.

datalake_engg needs CREATE_TABLE permission to create and manage tables in a database.

# lakeadmin in lakecli
lakecli> select * from database_privileges where principal like '%datalake_engg%';
0 rows in set
Time: 0.001s
lakecli> GRANT CREATE_TABLE ON DATABASE 'taxilake' to 'user/datalake_engg';
GRANT
Time: 3.071s
lakecli> select * from database_privileges where principal like '%datalake_engg%';
+----+-------------+--------------------+--------------+-------+
| id | schema_name | principal          | permission   | grant |
+----+-------------+--------------------+--------------+-------+
| 4  | taxilake    | user/datalake_engg | CREATE_TABLE | 0     |
+----+-------------+--------------------+--------------+-------+
1 row in set
Time: 0.027s
lakecli> GRANT SELECT ON TABLE 'taxidata'.'csv_misc' to 'user/datalake_engg';
GRANT
Time: 3.279s
lakecli> select * from table_privileges where principal like '%datalake_engg%';
+----+-------------+------------+--------------------+------------+-------+
| id | schema_name | table_name | principal          | permission | grant |
+----+-------------+------------+--------------------+------------+-------+
| 3  | taxidata    | csv_misc   | user/datalake_engg | SELECT     | 0     |
+----+-------------+------------+--------------------+------------+-------+
1 row in set
Time: 0.024s


#datalake_engg in athenacli
us-east-2:default> select zone from taxidata.csv_misc limit 10;
+---------------------------+
| zone                      |
+---------------------------+
| "Newark Airport"          |
| "Jamaica Bay"             |
| "Allerton/Pelham Gardens" |
| "Alphabet City"           |
| "Arden Heights"           |
| "Arrochar/Fort Wadsworth" |
| "Astoria"                 |
| "Astoria Park"            |
| "Auburndale"              |
| "Baisley Park"            |
+---------------------------+
10 rows in set
Time: 4.064s

us-east-2:default> create table taxilake.zones as select zone from taxidata.csv_misc;
0 rows in set
Time: 11.644s

us-east-2:default> select count(*) from taxilake.zones;
+-----------+
| _col0     |
+-----------+
| 107003212 |
+-----------+
1 row in set
Time: 4.030s
us-east-2:default> select count(*) from taxidata.csv_misc;
+-----------+
| _col0     |
+-----------+
| 107003212 |
+-----------+
1 row in set
Time: 4.844s

The data engineer automatically gets SELECT, INSERT, DELETE, ALTER and DROP permissions with GRANT capability.

\r:iamdb> select * from table_privileges where principal like '%datalake_engg%';
+----+-------------+------------+--------------------+------------+-------+
| id | schema_name | table_name | principal          | permission | grant |
+----+-------------+------------+--------------------+------------+-------+
| 1  | taxilake    | zones      | user/datalake_engg | ALL        | 1     |
| 2  | taxilake    | zones      | user/datalake_engg | ALTER      | 1     |
| 3  | taxilake    | zones      | user/datalake_engg | DELETE     | 1     |
| 4  | taxilake    | zones      | user/datalake_engg | DROP       | 1     |
| 5  | taxilake    | zones      | user/datalake_engg | INSERT     | 1     |
| 6  | taxilake    | zones      | user/datalake_engg | SELECT     | 1     |
| 7  | taxidata    | csv_misc   | user/datalake_engg | SELECT     | 0     |
+----+-------------+------------+--------------------+------------+-------+

Analyst/Data Scientist

datalake_user needs SELECT permission on a table to read data.

# lakeadmin in lakecli
lakecli> select * from table_privileges where principal like '%datalake_user%';
0 rows in set
Time: 0.002s

datalake_user also cannot run any queries in Athena.

# datalake_user in athenacli
us-east-2:default> SELECT COUNT(*) FROM taxidata.csv_misc;
Insufficient permissions to execute the query. Insufficient Lake Formation permission(s) on csv_misc

datalake_user has access to the table data once SELECT permission on the table is granted.

# lakeadmin in lakecli
lakecli> GRANT SELECT ON TABLE 'taxilake'.'zones' to 'user/datalake_user';
GRANT
Time: 1.499s
lakecli> select * from table_privileges where principal like '%datalake_user%';
+----+-------------+------------+--------------------+------------+-------+
| id | schema_name | table_name | principal | permission | grant |
+----+-------------+------------+--------------------+------------+-------+
| 2 | taxilake | csv_misc | user/datalake_user | SELECT | 0 |
+----+-------------+------------+--------------------+------------+-------+
1 row in set
Time: 0.024s

# datalake_user in athenacli
us-east-2:default> SELECT COUNT(*) FROM taxilake.zones;
+-----------+
| _col0     |
+-----------+
| 107003212 |
+-----------+
1 row in set
Time: 6.932s

Summary

AWS Lake Formation Permissions are better suited than IAM permissions to secure a data lake. The two main reasons are

  • Lake Formation Permissions are on logical objects like a database, table or column instead of files and directories.
  • Lake Formation Permissions provide granular control for column-level access.

This blog explores the permissions required for a basic setup of a dataset, administrator, data engineer and an analyst. Get in touch using the form below to discuss more on how to secure your data lake using Lake Formation.