We are in a golden age of data and the amount of data produced every second is 100 times bigger and larger than what we have generated may be a couple years of ago.
I have already talked about the The need of the Data in my previous blog https://ajithshetty28.medium.com/its-am-un-dsen-which-every-data-driven-company-needs-part-1-da07019c4c8f and how did Amundsen has solved the problem at Lyft.
But just lets re-iterate a few points.
We have a new source coming up in almost every week. It could be a cloud source system like redshift, Athena or a private organisations like snowflake.
Each of these sources are trying to solve a different set of problems with respect to the data.
Now we as a company, generate different types of data and from wide variety of sources and tools. So using the Source system tools we solved the main problem which is “how to store”.
Now as the data size grows SO even the source system, how do we maintain these datasets and its meta data.
- Where should we look for when we are need for a set of data
- Whats the lineage, how other subsets are formed from this dataset
- What is the source system and how do we ingest
- And many more.
In the previous blog I have explained how “Amundsen” tried to solve this problem.
This blog is where we will talking about the next gen metadata store which is LinkedIn’s Datahub which has been open-sourced already.
A fun fact
Did you know it’s a 3rd attempt at the LinkedIn to solve the Metadata problem.
- WhereHows https://engineering.linkedin.com/blog/2016/03/open-sourcing-wherehows--a-data-discovery-and-lineage-portal
- TMS: The Metadata Store (internal to LinkedIn)
- Datahub https://engineering.linkedin.com/blog/2019/data-hub
What is Datahub
Datahub is a metadata platform which contains the real time metadata about the different source systems. You can consider it as a wiki for your company where you may search for the required data by doing a wild card search or by the sources.
Datahub uses the kafka stream to get the changes reflected in seconds.
More importantly it supports PUSH, PULL, asynchronous and synchronous architecture for metadata ingestion and refresh.
Metadata ingestion pipelines can be integrated with Airflow to set up scheduled ingestion or capture lineage.
And in case your are using a new source which is not supported by the Datahub(yet), you may write your own.
High level details
- Datahub is built by LinkedIn and open-sourced.
- It supports Search of a metadata using Elastic
- It creates the relationship between user, dataset and dashboard using Neo4J
- Provides an interface by which the user can search for a data in the catalog.
- It supports the LDAP integration
- JDBC connectivity well supported.
Instead of elastic search you may use pinot, druid etc
Possibly you could use ne04j but datahub team is backing elastic search as it perfectly integrates with the current APIs.
Lets look deep inside the architecture
Responsible how the source are integrated with the datahub and can ingest the metadata to a central place irrespective of the sources and its internal architecture.
It supports different sources currently like
BigQuery, dbt, Druid, Feast ,File, Glue, Hive, Kafka Connect ,Kafka Metadata, LDAP, Looker dashboards, LookML, MongoDB, Microsoft SQL Server, MySQL, Okta, Oracle, PostgreSQL, Redshift, S3, SageMaker, Snowflake, SQL Profiles, Other SQLAlchemy databases, Superset, Sinks, Console, DataHub
Serving layer is responsible for maintenance of the freshness of the data and exposing it to the front end.
At a high level there are 2 concepts which we need to be aware.
- Metadata Commit log Event
Storage: How the metadata is maintained. To persist the data or metadata you are allowed to choose any RDBMS storage or NoSQL like couchbase.
Metadata Commitlog Event: is basically sends an event as soon as the metadata gets persisted. Every source sends an event right afte the storage is successful. Now using this you can build a system to let your team know that there is a new source which got introduced or may be its an old source but with some PII field, so the compliance team could lock that data immediately.
Search Index: As soon as the commitlog event is triggered, there will be a index creation api call to elastic search which is ultimately be used by the front end to serve the user searches.
The main idea behind the front end is to give the rich user experience and to place all the information at one place. Explained in details in the Demo Section.
- Support for fine-grained access control for metadata operations (read, write, modify)
- Scope: Access control on entity-level, aspect-level and within aspects as well.
- This provides the foundation for Tag Governance, Dataset Preview access control etc.
What to look for in the coming quarter
- Integration with systems like Great Expectations, AWS deequ, dbt test etc.
- Data Lake Ecosystem Integration Spark Delta Lake,Apache Iceberg,Apache Hudi
Look for the full list here https://datahubproject.io/docs/roadmap/
- Local installation
python3 -m pip install — upgrade pip wheel setuptools
python3 -m pip install — upgrade acryl-datahub
Setup a docker with minimum 3.8GB memory.
datahub docker quickstart
Hit the URL: http://localhost:9002
Now since you haven’t connected to any sources and to save your time you may hit the demo URL https://demo.datahubproject.io/
This includes all the examples and the source connectors for you to play around.
Screenshot: Example view of the snowflake data. Here you may look for the
- schema information
- Who all are the owners
- Properties and many more
Screenshot: Dashboard View
Click on the Lineage and you can see how the data is getting generated and from which sources.
Screenshot: Example lineage for Airflow tasks.
Screenshot: Wildcard Search
Screenshot: You can look for histogram
Having to compare the different Metadata catalog tool would require a different discussion but we will do a quick comparison at a very high level.
And from my perspective both Amundsen and Datahub are the best at what they are trying to solve.
Overall I found both Datahub and Amundsen found to be very interesting. Since the Datahub is LinkedIn 3rd attempt, they have learnt a lot from the previous 2 and made sure to add all the shortcomings as a main feature.
Having said that the Amundsen has managed to answer the metadata problem ever since its open-sourced and it has evolved and managed to add features which are needed by the community.
When it comes to the backend architecture, the Datahub is very much coupled with Kafka to that it can support real time updates.
With respect to manageability and configuration, the Datahub has bit of an upper-hand as most of its configurations are configurable plus it follows no code architecture.
DataHub Quickstart Guide | DataHub
To deploy a new instance of DataHub, perform the following steps. Install docker and docker-compose (if using Linux)…
DataHub: Popular metadata architectures explained
When I started my journey at LinkedIn ten years ago, the company was just beginning to experience extreme growth in the…
BigData Engineer — Love for Bigdata, Analytics, Cloud and Infrastructure.
Subscribe to my: Weekly Newsletter Just Enough Data