AWS Lake Formations helps to set up a secure data lake on AWS S3. One of the main goals of the product is Simplified Security Management. The central tenet of this goal is to define security, governance, and audit policies in a single location. Lake Formation permissions go beyond IAM policies to manage secure access to databases, tables, and columns.
The previous blog post explained how to set up a data lake with NYC Trip Data.
This tutorial uses the same data lake and shows how to set up basic permissions for two personas: data engineer and data analyst. It explains resources and permissions and sets up two users and their permissions (data engineer and data analyst) to explain the concepts and how they help to secure the data lake.
There are two types of resources in the AWS Lake Formation:
Metadata is stored in AWS Glue Data Catalog and consists of databases, tables, and columns.
Data is stored in AWS S3 locations or databases.
The table below describes the metadata permissions available for each resource in the catalog.
Data Location permissions are applicable to AWS S3 locations registered with Lake Formation. The only permission available as of this article is DATA_LOCATION_ACCESS.
DATA_LOCATION_ACCESS allows users to create tables without the need to have IAM permissions for an S3 location.
Similarly, SELECT, INSERT and DELETE permissions are available only on registered S3 locations.
An important point is that Lake Formation users do not need IAM permissions to access tables in a registered data location. Lake Formation's permissions ensure secure access instead of AWS IAM permissions.
Note that for database locations such as RDS or Redshift, the permissions are determined by the user specified in the JDBC connection string and the GRANT permissions in the database.
AWS Glue, AWS Athena, and AWS Lake Formation have been set up.
The NYC Taxi Trip data set is available in the data lake in the taxidata database.
lakeadmin is the AWS Lake Formation administrator.
lakecli and athenacli are set up and configured in your system.
First, create a few managed policies. Navigate to AWS Console > IAM > Policies > Create Policy and create the following policies
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"lakeformation:GetDataAccess",
"lakeformation:GrantPermissions",
"lakeformation:RevokePermissions",
"lakeformation:BatchGrantPermissions",
"lakeformation:BatchRevokePermissions",
"lakeformation:ListPermissions" ],
"Resource": "*"
}
]
}
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "PassRolePermissions",
"Effect": "Allow",
"Action": [
"iam:PassRole"
],
"Resource": [
"arn:aws:iam::account-id:role/workflow_role"
]
}
]
}
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"lakeformation:GetDataAccess",
"glue:GetTable",
"glue:GetTables",
"glue:SearchTables",
"glue:GetDatabase",
"glue:GetDatabases",
"glue:GetPartitions"
],
"Resource": "*"
}
]
}
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "VisualEditor0",
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:ListBucket"
],
"Resource": [
"arn:aws:s3:::nyc-tlc",
"arn:aws:s3:::nyc-tlc/*"
]
}
]
}
Navigate to AWS Console > IAM > Create User and create the following users with attached policies.
Create an S3 bucket called prefix_datalake. Choose a unique prefix. In AWS Console, go to AWS Lake Formation > Databases > Create Database Create a database with the following details:
Note that no explicit IAM permissions are set up for the users.
This section sets up the permission for datalake_engg. datalake_engg creates a table in taxilake from a table in taxidata.
datalake_engg needs CREATE_TABLE permission to create and manage tables in a database.
# lakeadmin in lakecli
lakecli> select * from database_privileges where principal like '%datalake_engg%';
0 rows in set
Time: 0.001s
lakecli> GRANT CREATE_TABLE ON DATABASE 'taxilake' to 'user/datalake_engg';
GRANT
Time: 3.071s
lakecli> select * from database_privileges where principal like '%datalake_engg%';
+----+-------------+--------------------+--------------+-------+
| id | schema_name | principal | permission | grant |
+----+-------------+--------------------+--------------+-------+
| 4 | taxilake | user/datalake_engg | CREATE_TABLE | 0 |
+----+-------------+--------------------+--------------+-------+
1 row in set
Time: 0.027s
lakecli> GRANT SELECT ON TABLE 'taxidata'.'csv_misc' to 'user/datalake_engg';
GRANT
Time: 3.279s
lakecli> select * from table_privileges where principal like '%datalake_engg%';
+----+-------------+------------+--------------------+------------+-------+
| id | schema_name | table_name | principal | permission | grant |
+----+-------------+------------+--------------------+------------+-------+
| 3 | taxidata | csv_misc | user/datalake_engg | SELECT | 0 |
+----+-------------+------------+--------------------+------------+-------+
1 row in set
Time: 0.024s
#datalake_engg in athenacli
us-east-2:default> select zone from taxidata.csv_misc limit 10;
+---------------------------+
| zone |
+---------------------------+
| "Newark Airport" |
| "Jamaica Bay" |
| "Allerton/Pelham Gardens" |
| "Alphabet City" |
| "Arden Heights" |
| "Arrochar/Fort Wadsworth" |
| "Astoria" |
| "Astoria Park" |
| "Auburndale" |
| "Baisley Park" |
+---------------------------+
10 rows in set
Time: 4.064s
us-east-2:default> create table taxilake.zones as select zone from taxidata.csv_misc;
0 rows in set
Time: 11.644s
us-east-2:default> select count(*) from taxilake.zones;
+-----------+
| _col0 |
+-----------+
| 107003212 |
+-----------+
1 row in set
Time: 4.030s
us-east-2:default> select count(*) from taxidata.csv_misc;
+-----------+
| _col0 |
+-----------+
| 107003212 |
+-----------+
1 row in set
Time: 4.844s
The data engineer automatically gets SELECT, INSERT, DELETE, ALTER and DROP permissions with GRANT capability.
\r:iamdb> select * from table_privileges where principal like '%datalake_engg%';
+----+-------------+------------+--------------------+------------+-------+
| id | schema_name | table_name | principal | permission | grant |
+----+-------------+------------+--------------------+------------+-------+
| 1 | taxilake | zones | user/datalake_engg | ALL | 1 |
| 2 | taxilake | zones | user/datalake_engg | ALTER | 1 |
| 3 | taxilake | zones | user/datalake_engg | DELETE | 1 |
| 4 | taxilake | zones | user/datalake_engg | DROP | 1 |
| 5 | taxilake | zones | user/datalake_engg | INSERT | 1 |
| 6 | taxilake | zones | user/datalake_engg | SELECT | 1 |
| 7 | taxidata | csv_misc | user/datalake_engg | SELECT | 0 |
+----+-------------+------------+--------------------+------------+-------+
datalake_user needs SELECT permission on a table to read data.
# lakeadmin in lakecli
lakecli> select * from table_privileges where principal like '%datalake_user%';
0 rows in set
Time: 0.002s
datalake_user also cannot run any queries in Athena.
# datalake_user in athenacli
us-east-2:default> SELECT COUNT(*) FROM taxidata.csv_misc;
Insufficient permissions to execute the query. Insufficient Lake Formation permission(s) on csv_misc
datalake_user has access to the table data once SELECT permission on the table is granted.
# lakeadmin in lakecli
lakecli> GRANT SELECT ON TABLE 'taxilake'.'zones' to 'user/datalake_user';
GRANT
Time: 1.499s
lakecli> select * from table_privileges where principal like '%datalake_user%';
+----+-------------+------------+--------------------+------------+-------+
| id | schema_name | table_name | principal | permission | grant |
+----+-------------+------------+--------------------+------------+-------+
| 2 | taxilake | csv_misc | user/datalake_user | SELECT | 0 |
+----+-------------+------------+--------------------+------------+-------+
1 row in set
Time: 0.024s
# datalake_user in athenacli
us-east-2:default> SELECT COUNT(*) FROM taxilake.zones;
+-----------+
| _col0 |
+-----------+
| 107003212 |
+-----------+
1 row in set
Time: 6.932s
AWS Lake Formation Permissions are better suited than IAM permissions to secure a data lake. The two main reasons are Lake Formation Permissions are on logical objects like a database, table, or column instead of files and directories. Lake Formation Permissions provide granular control for column-level access. This blog explores the permissions required for a basic setup of a dataset, administrator, data engineer, and analyst. Get in touch with us to discuss further how to secure your data lake using Lake Formation.
Get started on building a data governance platform for your organisation with our guest writer, Syed Atif Akhtar.
Learn which data system you should use, based on who uses the metadata.
The first step is to understand what data governance is. Data Governance is an overloaded term and means different things to different people. It has been helpful to define Data Governance based on the outcomes it is supposed to deliver.
Get in touch for bespoke support for PII Catcher
We can help discover, manage and secure sensitive data in your data warehouse.