lakeFS, Source Controlled Data

5 min readNov 27, 2021

source:https://github.com/databricks/tech-talks

The heading of this blog gave away the whole essence of the story what you are going to read.

We are producing the data at an extreme pace and at the same we have built a analytics use case to alter/transform the data and to give the insights, which ultimately you use to take business decisions.

When you are job is running 24*7and 365 days a year, it would be ridiculous to say that its impossible to have your pipeline fail.

Job failing is something you can never predict or control.

It could be a small issue as somebody changed a small config in the backend or as big as the whole datacenter went down.

But what we can be prepared with is the recovery plan if something fails.

Now the lakeFS comes into the picture.

What is lakeFS

Transform your object storage into a Git-like repository

lakeFS enables you to manage your data lake the way you manage your code. Run parallel pipelines for experimentation and CI/CD for your data.

source

Why do you need lakeFS

lakeFS introduces the source control to you big data.

We are too dependent on the cloud storage. It could be S3, ADLS or Google cloud storage.

And it is as resilient as it gets. But lakeFS is mainly trying to fill the gaps what we cannot achieve in the current cloud storage.

Features

lakeFS provides you the version control on top of your storage.
You can work on the same data but at isolation in your own branch and it will not cause any impact to any other version of it. Once you are ready, commit it to your main branch.
lakeFS supports the exabytes of version control. you can create as many branches as you want.
lake FS supports all the git like operation like branch, commit, merge, revert.
Pre-commit/merge hooks for data CI/CD.
When you realise that there was an error, you can instantly revert changes to data

General requirements and how lakeFS fulfils

Isolation

We always prefer to work in isolation and not to duisturb other people work

lakecl branch create my-branch

Revert the changes

Upon realising that there was an error, you want to rever back to previous.

lakecl revert main

Error reproducing

We can reproduce the errors by querying on top of the particular branch which failed.

spark.read.parquet(“test/myrepo/commit-id”)

Update data sets

When you confirm that the data is correct, you can merge.

lakectl merge test main

Tech stack Integration

lakeFS integrates with current modern big data stack which includes

Kafka, Spark, Databrick, Hudi, Presto, S3, ADLS and many more.

File format integration

lakeFS supports all the current file formats like Parquet, datalake and many more.

Object store Integration

lakeFS integrates with the any tool that has the object store.

Data Model

lakeFS uses the Graveler Data model which supports the exabyte of datasets.

https://docs.lakefs.io/understand/data-model.html

A Simple DEMO

curl https://compose.lakefs.io | docker-compose -f - up

Hit the URL: http://127.0.0.1:8000/setup

Create your first user.

Create your first repo

You can check the recent commits.

Make your first upload.

Create a branch.

Upload a file to a branch.

Commit the changes.

Merge the branch with the main.

Similar to above the real world use case works.

Where you can totally work at isolation with your own branch with logical segregation of the data.

Consider you are using Spark.

You need to set below parameters.

spark.hadoop.fs.s3a.bucket.<repo-name>.access.key XXXXXXXXXXXXXX
spark.hadoop.fs.s3a.bucket.<repo-name>.secret.key XXXXXXXXXXXX
spark.hadoop.fs.s3a.bucket.<repo-name>.endpoint https://lakefs.example.com
spark.hadoop.fs.s3a.path.style.access true

When you want to read the data, you will pass the REPO and the branch you are working in your read path.

val repo = “example-repo”
val branch = “main”
val dataPath = s”s3a://${repo}/${branch}/example-path/example-file.parquet”val df = spark.read.parquet(dataPath)

Similarly, while writing you will need to pass the repo and branch name while writing the data as well.

df.write
 .partitionBy(“example-column”)
 .parquet(s”s3a://${repo}/${branch}/output-path/”)

Source

Reference:

Atomic Versioned Data Lake - LakeFS

lakeFS is an open-source tool that transforms your object storage to Git-like repositories. Start managing data the way…

lakefs.io

GitHub - treeverse/lakeFS: Git-like capabilities for your object storage

lakeFS is an open source tool that transforms your object storage into a Git-like repository. It enables you to manage…

github.com

Databricks

Databricks is an Apache Spark-based analytics platform. For Databricks to work with lakeFS, set the S3 Hadoop…

docs.lakefs.io

Ajith Shetty

Bigdata Engineer — Bigdata, Analytics, Cloud and Infrastructure.

Subscribe✉️ ||More blogs📝||LinkedIn📊||Profile Page📚||Git Repo👓

Interested in getting the weekly newsletter on the big data analytics around the world, do subscribe to my: Weekly Newsletter Just Enough Data