Know your Data using OpenMetadata

source:https://docs.open-metadata.org/

We all are aware of the importance of the Metadata but little we care about it.

We tend to spend most of the time bringing in the data or to move it from one place to another.

But the real questions are:

  1. how well we know this data?
  2. How clean is the data?
  3. When was it last refreshed?
  4. Who is maintaining the data?
  5. Where the data is stored?
  6. What does each column means?

Storing the data and maintaining alone will not gain you anything.

We need to understand and use it to unlock business values. And to use the data we need to answer all the above questions.

This is one set of the problem.

Now from a different perspective, we have different personas with different level of understanding of the data so even the requirement from the data.

Say,

A data scientist, would look for all the columns and wants to know what do they mean and how they can use it for any model building

An architect would care about the daily data pipelines, how to refresh them and maintain them.

Analysts, would need to build the logic with this data by understanding them.

CEO, would like to see the sales growth or the projection for the next year.

Like above, we have 100s of personas asking 100 of questions of the data.

Now, to answer their questions we cannot build a different solutions.

Rather we need a single solution for all the questions.

Enter Open Metadata.

Need for Catalog

In 2021, We generate 100 times more data than what we used to produce 10 years back.

It becomes difficult to keep track of the different areas of the data like:

Data governance, which defines the security over the data as who is supposed to view the data and what type of data you are allowed to store.

Glossary defines the full list of data what we are maintaining.

Lineage, defines how the data has transformed over the time.

What is Open Metadata

Open Metadata is a single catalog which aggregates metadata of all the sources and presents them to the user based on the user requirement.

OpenMetadata is an open standard with a centralized metadata store and ingestion framework supporting connectors for a wide range of services. The metadata ingestion framework enables you to customize or add support for any service. REST APIs enable you to integrate OpenMetadata with existing tool chains. Using the OpenMetadata user interface (UI), data consumers can discover the right data to use in decision making and data producers can assess usage and consumer experience in order to plan improvements and prioritize bug fixes.

OpenMetadata enables metadata management end-to-end, giving you the ability to unlock the value of data assets in the common use cases of data discovery and governance, but also in emerging use cases related to data quality, observability, and people collaboration.

Source: https://docs.open-metadata.org/

Features

Lineage

You can look for the data lineage and the source from where its been generated and all the way to where it is consumed.

Add descriptions

You can define the owner and add descriptions to the data which will help you identify the authenticity and thr owner details whenever needed.

Profiling the data

You can view the details for each of the columns and the check null or non null value counts which can help you build your trust with the data.

Multi connector out of the box

You can integrate Database like Oracle, MySql, Snowflake, Hive and many more.

Data model like DBT.

Dashboard like Tableau, Superset, etc.

Messaging services like Kafka and Pulsar(WIP)

Pipeline like Airflow, prefect, etc.

Full list of connectors: https://docs.open-metadata.org/connectors

Metadata versioning

You can view the different versions of the data as te time passes. The changes in the data, column or owners and backtrack.

Wild card search

Search for the term like table or catalog. It supports boolean operators as well.

Demo

Prerequisites:

  1. Python 3.8.0 or higher
  2. pip 19.2.3 or higher
  3. Docker 20.10.0 or higher

Setup the directory

mkdir openmetadata-docker; cd openmetadata-docker

Create a python Environment

python3 -m venv env

Activate the Virtual environment

source env/bin/activate

Pip install the openmetadata

pip3 install ‘openmetadata-ingestion[docker]’

Start the metadata docker

metadata docker — start

Hit the below URL for the ingestion job in Airflow

http://localhost:8080

Username: admin

Password: admin

Finally, hit the below url for OpenMetadata with sample data.

http://localhost:8585

Search for the table

Search for the topics

Search for the dashboards

Search for the pipelines

Schema information

Column level profiling

Table references

References:

Ajith Shetty

Bigdata Engineer — Bigdata, Analytics, Cloud and Infrastructure.

Subscribe✉️ ||More blogs📝||LinkedIn📊||Profile Page📚||Git Repo👓

Interested in getting the weekly newsletter on the big data analytics around the world, do subscribe to my: Weekly Newsletter Just Enough Data

--

--

--

Bigdata Engineer — Love for BigData, Analytics, Cloud and Infrastructure. Want to talk more? Ping me in Linked In: https://www.linkedin.com/in/ajshetty28/

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

THE QUALYTICS 8: COVERAGE

EPL Fantasy GW19 Recap and GW20 Algo Picks

Getting Started with Quantitative Trading

A Mantra for When You Feel Like You’re Not Doing Enough

5+1 must-listen digital analytics podcasts

Are you ready for R?

Analyse, opportunités et prévisions de l’industrie mondiale 2020 à 2025

Digital Photo Organizer

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Ajith Shetty

Ajith Shetty

Bigdata Engineer — Love for BigData, Analytics, Cloud and Infrastructure. Want to talk more? Ping me in Linked In: https://www.linkedin.com/in/ajshetty28/

More from Medium

lakeFS, Source Controlled Data

Data Virtualization with Trino.

Apache Hop — Build your first pipeline

ETL using Event Notification