Apache Hudi pronounced “hoodie”
Data has become as expensive as oil or the gold. now more and more companies are spending millions of dollars on Data and taking Data driven decisions.
Keeping the freshness of the data and keeping it up-to date has become more and more important.
And at the same time the regulations as well has become stricter. Like GDPR or CCPA.
Having to store Terabytes of data daily we need to be able to adhere to the policies under GDPR where we have to have a mechanism to go delete a record for the given users.
GDPR is only 1 example out of 100 others.
The requirement of the next gen companies has changed and below are the high level requirements.
- Easy integration with realtime ingestion and easy storage
- Support to realtime queries.
- Data consistency
- Time travel
- Efficient updates
- incremental processing
- Data validation accuracy
- Data duplication
- small file problem
To answer the above problem the Apache HUDI has been introduced.
Hudi is a rich platform to build streaming data lakes with incremental data pipelines
on a self-managing database layer, while being optimized for lake engines and regular batch processing.
Apache Hudi is a is an open-source framework which is used to simplify the incremental data processing requirements.
Hudi will help you in Change Data Capture to maintain the delta at a regular interval or the realtime.
Hudi will manage the underlying data stored on S3 using Apache Parquet and Apache Avro.
With Hudi, late arriving data can be “upserted” into an existing data set. When changes are made, Hudi will find the appropriate files in S3 and rewrite them to incorporate the changes.
Hudi also allows you to view your data set at specific points in time.
In Technical terms
- Manages the versioning of the new updated data and time travel is possible between these versions.
- Supports indexing which will define where the given record is located.
- Compaction of the data at regular interval to solve the small file prolems.
- Atomic operations with rollback features.
Table Type defines as in how the data will be indexed and stored in the data storage.
Query type defines how the stored data will be exposed to the queires.
Copy on Write
When we have the simple job which has fairly simple updates at a regular intervals.
Query type Supported: read optimised, and incremental
In general when we get a write operation Hudi will analyse the number of files to be written.
Once the write is complete it will commit it to the commit timeline.
Merge on read
Merge on Read is used mainly for near real time ingestion.
To support realtime queries with less wait time.
When the data freshness is not the concern.
Query type: Supports Read optimised, incremental, real time
Merge on read will store the newly added data parallel to the older data but will not merge them together.
Delta data is stored as Write ahead logs, when the file reaches the given size then Compaction will kick in to merge them together and that is the time when the data will be visible to Read Optimised queries. Until then the read optimised query will always be reading the older version or the previous merged version.
Realtime queries will merge the delta and the initial files and sends back to the consumer.
Read optimised queries will give you old data if the data freshness is not a concern( will only read the dat in the file not the delta or write ahead)
Copy on Write Vs Merge On Read
Copy on Write will always rewrite the entire parquet as the merge will occur for every Delta write. where as Merge On read will merge only when the write ahead log or delta file reaches to specific file size.
Copy on write will end up creating multiple small files. where as merge on read will create larger files as it will merge at intervals.
Indexing in Hudi
Indexing in HUDI would help you when you need to find the given record for an update.
Instead of reading the entire data(given partition) we will read records in a given file where the record is present.
When the insert request comes in, the HUDI will intelligently decide the indexing and stores the data accordingly. This will be used during the read operation as to exlcude the files where the records are not present.
Hudi supports common schema evolution scenarios, such as adding a nullable field or promoting a datatype of a field, out-of-the-box. The evolved schema is queryable across engines, such as Presto, Hive and Spark SQL.
Hudi Write APIS
- Bulk insert
- Hive integration
- Snapshot: Latest time or given time using time travel.
- Incremental: Delta
In the Part 2 of this blog we shall deep dive into the implementation and code examples.
Hello from Apache Hudi | Apache Hudi
Hudi is a rich platform to build streaming data lakes with incremental data pipelines on a self-managing database…
Overview | Apache Hudi
Welcome to Apache Hudi! This overview will provide a high level summary of what Apache Hudi is and will orient you on…
Bigdata Engineer — Bigdata, Analytics, Cloud and Infrastructure.