We have always been challenged with the data maintenance and how efficient it could be to provide the one pit stop solution to your analytics.
When it comes to achieving the best storage performance we always tend to think the file format, which supports serialisation or compression.
Then comes the infrastructure to support the analytical uses cases.
But we hardly talk about a thin layer between your file formats and the compute.
Your compute like Spark, Hive or Flink will directly be talking to the file formats.
And then apply all type of filters like partitioning, predicate pushdown etc.
But what if we could apply the rule before you touch the file formats and disregard the files you are not interested in reading based on your query?
Enter Apache Iceberg a Table format.
Where does the Table format fits
Apache Iceberg is a layer on top of File format like parquet or ORC and the compute will make use of this table format on top of the file formats which is stored in the storage.
Apache Iceberg is an open table format for huge analytic datasets. Iceberg adds tables to compute engines including Spark, Trino, PrestoDB, Flink and Hive using a high-performance table format that works just like a SQL table.
Apache Iceberg was invented by Netflix and later open sourced to Apache Foundation.
It became the ASF high level project at May 2020.
And the main users of Iceberg are Netflix and Apple.
Goals of Iceberg
Iceberg has tried to answer a lot of pain points we have in the current generation of data.
- Support to in place file formats. How can we get the best-est of performance out of the data which I already have and without changing the pipeline or the file format.
- Schema evolution. As the time changes, the schema will change. How can we be able to support the schema evolution over time without braking the downstream applications.
- Atomic operations in BIGDATA. It’s a big ask in the big data world as to achieve the ACID compliance.
- To get rid of directory and file listing. Every query requires a file listing and discarding based on the footer(parquet).
- Altering partitioning column over the time and should support previous and new partitioning columns.
- Isolation between read and write.
- Implicit partitioning and I expect the engine to take care of my partitioning columns.
- Versioning and time travel.
And many more.
It’s one hell of requirements. and frankly Iceberg did answer all of them.
1. Support to in place file formats
As we have already discussed Iceberg is not a file format but it’s a table format.
Having to use the Iceberg, doesn’t require you to change the file format if you are using ORC, Avro or Parquet.
It integrates well with the current file format and it adds a layer on top of these file formats.
Plus the Iceberg is built in a way where it can run on top of the current execution engines like Hive, Spark or Presto.
2. Schema evolutions
Iceberg does supports ADD, DROP, RENAME, UPDATE or REORDER of a column.
The schema evolution or changes would only effect the metadata. It means you are not required to change or rewrite the whole data again.
Note that map keys do not support adding or dropping struct fields that would change equality.
3. Atomic Operations
All the writes will always work at isolation without impacting the current read or the current schema in the metadata.
Readers will never be able to read the partially committed data.
Once the operation is fully succeeded the atomic operation will be committed and will be visible to the readers.
4. To get rid of directory and file listing
The traditional file format requires you to read the list of directories in case of partitioning and list all the files within the partition and based on the footer exclude the files which are not in the filter clause.
Though we are not in need of most of the partitions and fields within those partitions we end up listing the each and every-time for every query.
It’s a lot of listing.
Iceberg doesn't require a listing of the files. It maintains the data in the manifests.
We shall talk about it more in the architecture.
5. Altering the partitioning columns
The architecture we have defines years back will not be able to support all my future use cases.
Similarly in case of partitioning the data, the data distribution based on the column we have defined would need a change as the time progresses.
For eg. we may want to change the partitioning column from Month to date without having to rewrite the whole data.
Iceberg does support the different level of partitioning at the same schema.
6. Isolation between read and write
Iceberg maintains the snapshots of the files which changed as time progresses. This will support the READ and WRITE to occur parallel but in isolation.
Unless the new write is committed, this snapshot will not be available for read.
This will make sure your data and schema will not break or read the partial data.
7. Implicit partitioning
Defining the partitioning shouldn't be mandatory. As m requirement will change with the time. Having to know the partitioning at the run time will improve your query performance exponentially.
Iceberg just does that. Implicitly the partitioning column will be fetched and read.
8. Time travel
As e have already discussed the Iceberg supports the versioning. This makes the time travel possible if you are required to go back to a previous version as per your requirement.
Iceberg stores the medata parallel to the data files.
Snapshot metadata file: This stores the table specifications like table schema, the partitioning columns and the path to the manifest list.
Manifest list: 1 manifest can have multiple manifests. Each manifest file pointing to a particular version of the file. This contains the data file count, the partitioning columns, the min max of the data etc.
Data file: is basically your data in your given file format like parquet, orc or Avro.
Let’s create a dummy table and write as iceberg and list the folder path.
We see that there are 2 folders under our home table path.
Data folder contains the actual files in parquet.
Metadata folder contains the manifests.
Avro file contains the 4 records. Each record represents a file in the data directory.
With that it contains the path, file format, the record count , the block size, the partition column and many more information.
metadata snap fie contains the path the manifest and the version details.
Since there was only 1 write we have a 1 snapshot record.
And finally metdata contains the metdata of each columns with the path of the file and the version details.
When we wrote the same data again to the same iceberg path we can see that it made a new version of the file where it created a new version json file.
And it has added a new avro file with for the new files written.
Some tips while you use the Iceberg
Increasing number of snapshots: Over the time you tend to keep adding new snapshots of a file which will keep growing so even the metdata files.
It is advised to keep the few pf the versions and make use of expire snapshots based on your use cases.
Orphan data metdata files: These are uncommitted files which might have failed while writing.
You may use an API RemoveOrphanFilesAction to DELETE them.
About - Apache Iceberg
Apache Iceberg is an open table format for huge analytic datasets. Iceberg adds tables to compute engines including…
GitHub - apache/iceberg: Apache Iceberg
Apache Iceberg is a new table format for storing large, slow-moving tabular data. It is designed to improve on the…
You may find the example code in my github link
GitHub - ajithshetty/IcebergDemo
Contribute to ajithshetty/IcebergDemo development by creating an account on GitHub.
Bigdata Engineer — Bigdata, Analytics, Cloud and Infrastructure.
Interested in getting the weekly updates on the big data analytics around the world, do subscribe to my: Weekly Newsletter Just Enough Data