lakeFS, Source Controlled Data
The heading of this blog gave away the whole essence of the story what you are going to read.
We are producing the data at an extreme pace and at the same we have built a analytics use case to alter/transform the data and to give the insights, which ultimately you use to take business decisions.
When you are job is running 24*7and 365 days a year, it would be ridiculous to say that its impossible to have your pipeline fail.
Job failing is something you can never predict or control.
It could be a small issue as somebody changed a small config in the backend or as big as the whole datacenter went down.
But what we can be prepared with is the recovery plan if something fails.
Now the lakeFS comes into the picture.
What is lakeFS
Transform your object storage into a Git-like repository
lakeFS enables you to manage your data lake the way you manage your code. Run parallel pipelines for experimentation and CI/CD for your data.
Why do you need lakeFS
lakeFS introduces the source control to you big data.
We are too dependent on the cloud storage. It could be S3, ADLS or Google cloud storage.
And it is as resilient as it gets. But lakeFS is mainly trying to fill the gaps what we cannot achieve in the current cloud storage.
Features
- lakeFS provides you the version control on top of your storage.
- You can work on the same data but at isolation in your own branch and it will not cause any impact to any other version of it. Once you are ready, commit it to your main branch.
- lakeFS supports the exabytes of version control. you can create as many branches as you want.
- lake FS supports all the git like operation like branch, commit, merge, revert.
- Pre-commit/merge hooks for data CI/CD.
- When you realise that there was an error, you can instantly revert changes to data
General requirements and how lakeFS fulfils
Isolation
We always prefer to work in isolation and not to duisturb other people work
lakecl branch create my-branch
Revert the changes
Upon realising that there was an error, you want to rever back to previous.
lakecl revert main
Error reproducing
We can reproduce the errors by querying on top of the particular branch which failed.
spark.read.parquet(“test/myrepo/commit-id”)
Update data sets
When you confirm that the data is correct, you can merge.
lakectl merge test main
Tech stack Integration
lakeFS integrates with current modern big data stack which includes
Kafka, Spark, Databrick, Hudi, Presto, S3, ADLS and many more.
File format integration
lakeFS supports all the current file formats like Parquet, datalake and many more.
Object store Integration
lakeFS integrates with the any tool that has the object store.
Data Model
lakeFS uses the Graveler Data model which supports the exabyte of datasets.
https://docs.lakefs.io/understand/data-model.html
A Simple DEMO
curl https://compose.lakefs.io | docker-compose -f - up
Hit the URL: http://127.0.0.1:8000/setup
Create your first user.
Create your first repo
You can check the recent commits.
Make your first upload.
Create a branch.
Upload a file to a branch.
Commit the changes.
Merge the branch with the main.
Similar to above the real world use case works.
Where you can totally work at isolation with your own branch with logical segregation of the data.
Consider you are using Spark.
You need to set below parameters.
spark.hadoop.fs.s3a.bucket.<repo-name>.access.key XXXXXXXXXXXXXX
spark.hadoop.fs.s3a.bucket.<repo-name>.secret.key XXXXXXXXXXXX
spark.hadoop.fs.s3a.bucket.<repo-name>.endpoint https://lakefs.example.com
spark.hadoop.fs.s3a.path.style.access true
When you want to read the data, you will pass the REPO and the branch you are working in your read path.
val repo = “example-repo”
val branch = “main”
val dataPath = s”s3a://${repo}/${branch}/example-path/example-file.parquet”val df = spark.read.parquet(dataPath)
Similarly, while writing you will need to pass the repo and branch name while writing the data as well.
df.write
.partitionBy(“example-column”)
.parquet(s”s3a://${repo}/${branch}/output-path/”)
Reference:
Ajith Shetty
Bigdata Engineer — Bigdata, Analytics, Cloud and Infrastructure.
Subscribe✉️ ||More blogs📝||LinkedIn📊||Profile Page📚||Git Repo👓
Interested in getting the weekly newsletter on the big data analytics around the world, do subscribe to my: Weekly Newsletter Just Enough Data