Version Control for Data Lake with LakeFS
Keeping track of changes and revisions of your data is more crucial than ever in the fast-paced world of AI, data science, and analysis today. A crucial tool for managing these changes is version control, which makes it simple to keep track of, work together on, and go back to earlier iterations of your data.
Version control is crucial for ensuring that your data is correct, consistent, and dependable whether you are working alone or with a team.
In this blog post, we’ll talk about lakeFS, an open-source version control tool for Data Lakes, and look at how it may enhance your productivity and workflow.
Architecture
At its core, lakeFS uses a combination of object storage and a metadata layer, allowing it to provide powerful version control and collaboration features while being highly scalable and cost-effective.
AWS S3, Azure Blob Storage, Google Cloud Storage, and MinIO are supported as storage engines, while DynamoDB and PostgreSQL are supported as metadata stores by lakeFS.
lakeFS acts as an intermediate layer between the Query Engines and storage engines. It provides an excellent option to incorporate versioning in the existing Data Lake solutions without major changes to the existing architecture.
lakeFS supports Hadoop FileSystem configuration, which means systems like spark/Databricks can directly access the lakeFS repo’s with minimal configuration.
An example use of lakeFS file inside spark:
val repo = "example-repo"
val branch = "main"
val dataPath = s"lakefs://${repo}/${branch}/example-path/example-file.parquet"
val df = spark.read.parquet(dataPath)
Image Credits (https://docs.lakefs.io/understand/architecture.html)
1. Setting up AWS Resources
This tutorial will be using AWS S3 as a storage layer and DynamoDB for metadata storage for lakeFS.
We shall be using Terraform CLI to create these AWS resources. Instructions for installing the CLI can be found here.
The first step would be to create a variables.tf file that will hold information like the S3 bucket name, environment, and AWS region.
variable "s3_bucket_name" {
type = string
default = "<replace_with_bucket_name>"
}
variable "env" {
type = string
default = "development"
}
variable "zones" {
type = map
default = {
"north_virginia" = "us-east-1"
}
}
The resources we need to create are :
- S3 Bucket (Private)
- IAM User for lakeFS
- Bucket Policy for lakeFS to access S3 Bucket
- IAM policy to allow lakeFS to create a DynamoDB table (name: kvstore)
- Access Key & Secret Key for lakeFS IAM account
Now let’s create main.tf file which shall have all the definitions for AWS resources.
terraform {
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 4.16"
}
}
required_version = ">= 1.2.0"
}
provider "aws" {
region = var.zones.north_virginia
}
resource "aws_s3_bucket" "lakefs_bucket" {
bucket = var.s3_bucket_name
tags = {
name = var.s3_bucket_name
env = var.env
}
}
resource "aws_s3_bucket_acl" "lakefs_bucket" {
bucket = aws_s3_bucket.lakefs_bucket.id
acl = "private"
}
resource "aws_iam_user" "lakefs_user" {
name = "lakefs_user"
path = "/system/"
tags = {
env = var.env
}
}
resource "aws_iam_access_key" "lakefs_user" {
user = aws_iam_user.lakefs_user.name
}
resource "aws_s3_bucket_policy" "lakefs_s3_access" {
bucket = aws_s3_bucket.lakefs_bucket.id
policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Sid = "lakeFSObjects",
Action = [
"s3:GetObject",
"s3:PutObject",
"s3:AbortMultipartUpload",
"s3:ListMultipartUploadParts"
],
Effect = "Allow",
Principal = {
AWS = ["${aws_iam_user.lakefs_user.arn}"]
},
Resource = ["${aws_s3_bucket.lakefs_bucket.arn}/*"]
},
{
Sid = "lakeFSBucket",
Action = [
"s3:ListBucket",
"s3:GetBucketLocation",
"s3:ListBucketMultipartUploads"
],
Effect = "Allow",
Principal = {
AWS = ["${aws_iam_user.lakefs_user.arn}"]
},
Resource = ["${aws_s3_bucket.lakefs_bucket.arn}"]
}
]
})
}
resource "aws_iam_user_policy" "lakefs_user_ro" {
name = "lakefs_user_policy"
user = aws_iam_user.lakefs_user.name
policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Sid = "ListAndDescribe",
Effect = "Allow",
Action = [
"dynamodb:List*",
"dynamodb:DescribeReservedCapacity*",
"dynamodb:DescribeLimits",
"dynamodb:DescribeTimeToLive"
],
Resource = "*"
},
{
Sid = "kvstore",
Effect = "Allow",
Action = [
"dynamodb:BatchGet*",
"dynamodb:DescribeTable",
"dynamodb:Get*",
"dynamodb:Query",
"dynamodb:Scan",
"dynamodb:BatchWrite*",
"dynamodb:CreateTable",
"dynamodb:Delete*",
"dynamodb:Update*",
"dynamodb:PutItem"
],
Resource = "arn:aws:dynamodb:*:*:table/kvstore"
}
]
})
}
output "access_key" {
value = aws_iam_access_key.lakefs_user.id
}
output "secret_key" {
value = nonsensitive(aws_iam_access_key.lakefs_user.secret)
}
Note: nonsensitive function used in the secret_key output is not recommended for production scenarios. This is only for development purposes.
Once the variables.tf and main.tf are created, use this command to initialize Terraform state.
terraform init
Once Terraform is initialized, use this command to preview the resources that Terraform will be creating on your AWS account.
terraform plan
If everything looks good, use this command to go ahead and create the resources.
terraform apply
Terraform would ask for confirmation to deploy the resources. Please type “yes” to approve. Once all the resources are created, you can expect an output as shown below.
The access_key and secret_key values are required in the next steps while setting up lakeFS.
2. Setting up lakeFS
Once all the AWS Resources are created, we can set up lakeFS using Docker and Docker Compose.
Firstly, create a .env file with the below values. <access_key> and <secret_key> files need to be replaced with the output from Terraform deployment we have done in the previous steps.
LAKEFS_BLOCKSTORE_TYPE=s3
LAKEFS_BLOCKSTORE_S3_FORCE_PATH_STYLE=true
LAKEFS_BLOCKSTORE_S3_REGION=us-east-1
LAKEFS_BLOCKSTORE_S3_CREDENTIALS_ACCESS_KEY_ID=<access_key>
LAKEFS_BLOCKSTORE_S3_CREDENTIALS_SECRET_ACCESS_KEY=<secret_key>
LAKEFS_AUTH_ENCRYPT_SECRET_KEY=secret
LAKEFS_DATABASE_TYPE=dynamodb
LAKEFS_DATABASE_DYNAMODB_AWS_REGION=us-east-1
LAKEFS_DATABASE_DYNAMODB_AWS_ACCESS_KEY_ID=<access_key>
LAKEFS_DATABASE_DYNAMODB_AWS_SECRET_ACCESS_KEY=<secret_key>
LAKEFS_STATS_ENABLED=false
LAKEFS_LOGGING_LEVEL=DEBUG
Create a docker-compose.yml with the below config.
version: '3.7'
services:
lakefs:
image: treeverse/lakefs:latest
container_name: lakefs
ports:
- 8000:8000
env_file:
- .env
Use the below command to start the lakeFS container.
docker-compose up
lakeFS takes a while to start the container. First, it starts creating the metadata store on DynamoDB & once it is done, you can see the output on the console as shown below.
At this stage, lakeFS is ready for initial setup and can be accessed at this url: http://127.0.0.1:8000
Please provide an email (for communication purposes) and username for the admin account on the initial screen.
Once the user name is provided, lakeFS will be creating a user account in the metadata store (DynamoDB) and provides a set of Access Key and Secret Key. It also provides an option to download the config file for lakectl CLI which we will be using in the subsequent steps.
With this step, you are all set to create your first repo on the S3 Data Lake using lakeFS.
3. Create a Repository on lakeFS
Now that lakeFS is ready, let’s create a repository and add some files to the data lake.
Create a new repository with a repo name, S3 location, and default branch. For S3, please use this format: s3://<bucket_name>/<path> or s3://<bucket_name>/
lakeFS creates a repository on the given S3 location and we are ready to add files to the repo.
4. Setting up lakeFS CLI
Files can be uploaded to the repo from lakeFS UI directly or using lakectl CLI.
For large data ingestion, it is recommended to use lakectl. The setup instructions for the CLI can be found here.
Once the CLI is installed, use the Access Key and Secret Key provided for the lakeFS admin account in the previous step. The server endpoint should be pointed to http://127.0.0.1:8000
lakectl config
5. Add files to lakeFS Repo
We will be using nyc-taxi-trip dataset which has train.csv and test.csv files for demonstration purposes. The dataset can be found here: https://www.kaggle.com/competitions/nyc-taxi-trip-duration/data
The dataset can be uploaded to the same S3 bucket we had created with terraform at a different location: s3://<bucket_name>/datasets/.
Once the dataset files are uploaded to the above path at S3, use this command to copy them to the lakeFS repo.
lakectl ingest \
--from s3://<bucket_name>/datasets/ \
--to lakefs://lakefs-repo/main/datasets/
The — from source URL also supports ingesting data from external buckets which can be public or private. For private buckets, the IAM user for lakeFS needs to have sufficient permissions.
This is the anatomy of the lakeFS endpoint, which will be helpful for all the interactions with lakeFS.
lakefs://<repo_name>/<branch>/<your_desired_path>/
Once the ingestion is successful, you can refresh the lakeFS UI to validate the upload files.
6. Versioning with lakeFS
Now that files are added to the repo, we need to create a commit for the addition of files.
lakectl commit lakefs://lakefs-repo/main --message "Initial Magic"
Once a commit is created, it can be verified on lakeFS UI.
New branches for dev and staging can be created out of the main branch using the below commands.
lakectl branch create lakefs://lakefs-repo/dev --source lakefs://lakefs-repo/main
lakectl branch create lakefs://lakefs-repo/staging --source lakefs://lakefs-repo/dev
New changes/files can be uploaded to the dev and staging branches and a commit can be created as we did earlier for the main branch.
Branches can be merged into upstream branches similar to git using the below command.
lakectl merge lakefs://lakefs-repo/staging lakefs://lakefs-repo/dev
lakectl merge lakefs://lakefs-repo/dev lakefs://lakefs-repo/main
Documentation for various other lakes commands can be found here.
7. Conclusion
In this tutorial, we tried to set up a basic lakeFS configuration on AWS using DynamoDB and S3.
In the subsequent posts, we will explore the advanced configuration of lakeFS as well as interaction with query engines like spark and Trino.
Code snippets used in this tutorial are available in the repo: https://github.com/saivarunk/lakefs-recipes/tree/main/aws-initial-setup