DATAHUB: With More Data comes more responsibility

source: https://github.com/linkedin/datahub

We are in a golden age of data and the amount of data produced every second is 100 times bigger and larger than what we have generated may be a couple years of ago.

I have already talked about the The need of the Data in my previous blog https://ajithshetty28.medium.com/its-am-un-dsen-which-every-data-driven-company-needs-part-1-da07019c4c8f and how did Amundsen has solved the problem at Lyft.

But just lets re-iterate a few points.

We have a new source coming up in almost every week. It could be a cloud source system like redshift, Athena or a private organisations like snowflake.

Each of these sources are trying to solve a different set of problems with respect to the data.

Now we as a company, generate different types of data and from wide variety of sources and tools. So using the Source system tools we solved the main problem which is “how to store”.

Now as the data size grows SO even the source system, how do we maintain these datasets and its meta data.

  1. Where should we look for when we are need for a set of data
  2. Whats the lineage, how other subsets are formed from this dataset
  3. What is the source system and how do we ingest
  4. And many more.

In the previous blog I have explained how “Amundsen” tried to solve this problem.

This blog is where we will talking about the next gen metadata store which is LinkedIn’s Datahub which has been open-sourced already.

A fun fact

Did you know it’s a 3rd attempt at the LinkedIn to solve the Metadata problem.

  1. WhereHows https://engineering.linkedin.com/blog/2016/03/open-sourcing-wherehows--a-data-discovery-and-lineage-portal
  2. TMS: The Metadata Store (internal to LinkedIn)
  3. Datahub https://engineering.linkedin.com/blog/2019/data-hub

What is Datahub

Datahub is a metadata platform which contains the real time metadata about the different source systems. You can consider it as a wiki for your company where you may search for the required data by doing a wild card search or by the sources.

Datahub uses the kafka stream to get the changes reflected in seconds.

More importantly it supports PUSH, PULL, asynchronous and synchronous architecture for metadata ingestion and refresh.

Metadata ingestion pipelines can be integrated with Airflow to set up scheduled ingestion or capture lineage.

And in case your are using a new source which is not supported by the Datahub(yet), you may write your own.

High level details

  1. Datahub is built by LinkedIn and open-sourced.
  2. It supports Search of a metadata using Elastic
  3. It creates the relationship between user, dataset and dashboard using Neo4J
  4. Provides an interface by which the user can search for a data in the catalog.
  5. It supports the LDAP integration
  6. JDBC connectivity well supported.

Instead of elastic search you may use pinot, druid etc

Possibly you could use ne04j but datahub team is backing elastic search as it perfectly integrates with the current APIs.

Architecture

Lets look deep inside the architecture

1. Ingestion

Responsible how the source are integrated with the datahub and can ingest the metadata to a central place irrespective of the sources and its internal architecture.

It supports different sources currently like

BigQuery, dbt, Druid, Feast ,File, Glue, Hive, Kafka Connect ,Kafka Metadata, LDAP, Looker dashboards, LookML, MongoDB, Microsoft SQL Server, MySQL, Okta, Oracle, PostgreSQL, Redshift, S3, SageMaker, Snowflake, SQL Profiles, Other SQLAlchemy databases, Superset, Sinks, Console, DataHub

2. Storage

Serving layer is responsible for maintenance of the freshness of the data and exposing it to the front end.

At a high level there are 2 concepts which we need to be aware.

  1. storage
  2. Metadata Commit log Event

Storage: How the metadata is maintained. To persist the data or metadata you are allowed to choose any RDBMS storage or NoSQL like couchbase.

Metadata Commitlog Event: is basically sends an event as soon as the metadata gets persisted. Every source sends an event right afte the storage is successful. Now using this you can build a system to let your team know that there is a new source which got introduced or may be its an old source but with some PII field, so the compliance team could lock that data immediately.

Search Index: As soon as the commitlog event is triggered, there will be a index creation api call to elastic search which is ultimately be used by the front end to serve the user searches.

3. Frontend

The main idea behind the front end is to give the rich user experience and to place all the information at one place. Explained in details in the Demo Section.

Highlights

Recent achievement

  • Support for fine-grained access control for metadata operations (read, write, modify)
  • Scope: Access control on entity-level, aspect-level and within aspects as well.
  • This provides the foundation for Tag Governance, Dataset Preview access control etc.

What to look for in the coming quarter

  • Integration with systems like Great Expectations, AWS deequ, dbt test etc.
  • Data Lake Ecosystem Integration Spark Delta Lake,Apache Iceberg,Apache Hudi

Look for the full list here https://datahubproject.io/docs/roadmap/

DEMO

  1. Local installation
python3 -m pip install — upgrade pip wheel setuptools
python3 -m pip install — upgrade acryl-datahub

Setup a docker with minimum 3.8GB memory.

datahub docker quickstart

Hit the URL: http://localhost:9002

Now since you haven’t connected to any sources and to save your time you may hit the demo URL https://demo.datahubproject.io/

This includes all the examples and the source connectors for you to play around.

Screenshot: Example view of the snowflake data. Here you may look for the

  1. schema information
  2. Who all are the owners
  3. Lineage
  4. Properties and many more

Screenshot: Dashboard View

Click on the Lineage and you can see how the data is getting generated and from which sources.

Screenshot: Example lineage for Airflow tasks.

Screenshot: Wildcard Search

Screenshot: You can look for histogram

Having to compare the different Metadata catalog tool would require a different discussion but we will do a quick comparison at a very high level.

And from my perspective both Amundsen and Datahub are the best at what they are trying to solve.

Overall I found both Datahub and Amundsen found to be very interesting. Since the Datahub is LinkedIn 3rd attempt, they have learnt a lot from the previous 2 and made sure to add all the shortcomings as a main feature.

Having said that the Amundsen has managed to answer the metadata problem ever since its open-sourced and it has evolved and managed to add features which are needed by the community.

When it comes to the backend architecture, the Datahub is very much coupled with Kafka to that it can support real time updates.

With respect to manageability and configuration, the Datahub has bit of an upper-hand as most of its configurations are configurable plus it follows no code architecture.

Reference

Ajith Shetty

BigData Engineer — Love for Bigdata, Analytics, Cloud and Infrastructure.

Subscribe✉️ ||More blogs📝||Linked In📊||Profile Page📚||Git Repo👓

Subscribe to my: Weekly Newsletter Just Enough Data

Bigdata Engineer — Love for BigData, Analytics, Cloud and Infrastructure. Want to talk more? Ping me in Linked In: https://www.linkedin.com/in/ajshetty28/