Know your Data using OpenMetadata
We all are aware of the importance of the Metadata but little we care about it.
We tend to spend most of the time bringing in the data or to move it from one place to another.
But the real questions are:
- how well we know this data?
- How clean is the data?
- When was it last refreshed?
- Who is maintaining the data?
- Where the data is stored?
- What does each column means?
Storing the data and maintaining alone will not gain you anything.
We need to understand and use it to unlock business values. And to use the data we need to answer all the above questions.
This is one set of the problem.
Now from a different perspective, we have different personas with different level of understanding of the data so even the requirement from the data.
A data scientist, would look for all the columns and wants to know what do they mean and how they can use it for any model building
An architect would care about the daily data pipelines, how to refresh them and maintain them.
Analysts, would need to build the logic with this data by understanding them.
CEO, would like to see the sales growth or the projection for the next year.
Like above, we have 100s of personas asking 100 of questions of the data.
Now, to answer their questions we cannot build a different solutions.
Rather we need a single solution for all the questions.
Enter Open Metadata.
Need for Catalog
In 2021, We generate 100 times more data than what we used to produce 10 years back.
It becomes difficult to keep track of the different areas of the data like:
Data governance, which defines the security over the data as who is supposed to view the data and what type of data you are allowed to store.
Glossary defines the full list of data what we are maintaining.
Lineage, defines how the data has transformed over the time.
What is Open Metadata
Open Metadata is a single catalog which aggregates metadata of all the sources and presents them to the user based on the user requirement.
OpenMetadata is an open standard with a centralized metadata store and ingestion framework supporting connectors for a wide range of services. The metadata ingestion framework enables you to customize or add support for any service. REST APIs enable you to integrate OpenMetadata with existing tool chains. Using the OpenMetadata user interface (UI), data consumers can discover the right data to use in decision making and data producers can assess usage and consumer experience in order to plan improvements and prioritize bug fixes.
OpenMetadata enables metadata management end-to-end, giving you the ability to unlock the value of data assets in the common use cases of data discovery and governance, but also in emerging use cases related to data quality, observability, and people collaboration.
You can look for the data lineage and the source from where its been generated and all the way to where it is consumed.
You can define the owner and add descriptions to the data which will help you identify the authenticity and thr owner details whenever needed.
Profiling the data
You can view the details for each of the columns and the check null or non null value counts which can help you build your trust with the data.
Multi connector out of the box
You can integrate Database like Oracle, MySql, Snowflake, Hive and many more.
Data model like DBT.
Dashboard like Tableau, Superset, etc.
Messaging services like Kafka and Pulsar(WIP)
Pipeline like Airflow, prefect, etc.
Full list of connectors: https://docs.open-metadata.org/connectors
You can view the different versions of the data as te time passes. The changes in the data, column or owners and backtrack.
Wild card search
Search for the term like table or catalog. It supports boolean operators as well.
- Python 3.8.0 or higher
- pip 19.2.3 or higher
- Docker 20.10.0 or higher
Setup the directory
mkdir openmetadata-docker; cd openmetadata-docker
Create a python Environment
python3 -m venv env
Activate the Virtual environment
Pip install the openmetadata
pip3 install ‘openmetadata-ingestion[docker]’
Start the metadata docker
metadata docker — start
Hit the below URL for the ingestion job in Airflow
Finally, hit the below url for OpenMetadata with sample data.
Search for the table
Search for the topics
Search for the dashboards
Search for the pipelines
Column level profiling
GitHub - open-metadata/OpenMetadata: Open Standard for Metadata. A Single place to Discover…
OpenMetadata is an Open Standard for Metadata. A Single place to Discover, Collaborate, and Get your data right…
Bigdata Engineer — Bigdata, Analytics, Cloud and Infrastructure.
Interested in getting the weekly newsletter on the big data analytics around the world, do subscribe to my: Weekly Newsletter Just Enough Data