Mark the Marquez

5 min readOct 3, 2021

source: https://marquezproject.github.io/marquez/

Importance on Metadata

Managing the data would be easier than managing its metadata.

Frankly speaking it would not be wrong to say that the Metadata is far more important than the actual data.

We can unlock a lot of possibilities with data. But having to do that with the never ending metadata changes would not be lesser difficult than building the whole infrastructure.

Why we need metadata

The data we generate is already 100 times bigger than what you have produced may be last year. And it will only grow.

We keep introducing new source systems every day. It could be batch like redshift, snowflake or may be an IoT like weather, traffic, temperature etc.

Now we cant be certain how these data gets changes over the time.

For an example. You introduce a new field. which might have not been passed along to the downsteam team and they will never know unless its specified.

Or to make your downstram application life difficult, you have changed the schema. and all the pipelines dependent on this data starts failing.

We need a strong lineage of the data as:

What are the sources we are connecting
How we are pulling the data
Who are the owners of the data
How fresh your data is
How clean the data is
and many more.

And having to know the above questions answered we are trying to build a data driven team with:

The data you can rely on. Where you can know the row and column level lineage
you can add the more and more context for your downstream application to understand the change
self service for anyone at anytime
unlock possibilities and many more

Enter Marquez

What is Marquez

Marquez is an open source metadata service for the collection, aggregation, and visualization of a data ecosystem’s metadata. It maintains the provenance of how datasets are consumed and produced, provides global visibility into job runtime and frequency of dataset access, centralization of dataset lifecycle management, and much more. Marquez was released and open sourced by WeWork.

Source: https://github.com/MarquezProject/marquez

The main goal of the Marques is to collect the metadata from the pipeline as it is running.

This gives the precise information of the data.

Where does it sits?

Marquez is placed in between the Sources it connects and the end users for self service, data discovery, Data quality etc.

Core concepts

At the heart of Marquez we have:

Centralised Metadata management

Datasets
Jobs

Modular

Data Health
Data Triggers
Data dicsvovery/Search

Centralised Metadata management: Is basically capturing the dataset details and the job information like what is the job ID, what was the state of the data before and after.

All this metadata is captured through REST API

Modular: Since its modular in architecture, Marques can be used specifically to check the health and quality of the data or to search for a specific tag of the data.

Data discovery/search: It is a unified search platform where you maintain the data documentation and tags and by which you can search for your required dataset.

Data health: This will heap you answer how accurate your data is. Check for the schema. Or you may get the source where the data is stored.

Triggers: Processing of the data state bu without polling. With trigger you will answer how incomplete my data is and reduce the manual handling of the backfilling.

Versioning

Marques has a concept of versioning. Which is basically for every run and every dataset it maintains a different versions than the previous.

So having this version will help you to get the delta of previous run or datasets with current.

Benefits of Marquez architecture

Having this unique Marques architecture will help you in:

Debugging the issue per versions and to find the root cause
Source/destination version effected and impacted
Failure detection and recovery

Data model

Sources can have 1or more datasets.

And you will have 1 or more data versions as the job progresses.

Jobs will have one or more than 1 version as we run them. And each run will produce a different data versions based on the input.

Metadata Collection

Currently the metadata is collected using

Marques API
SDKs like Python and Java

Workflow

Register a Job

Which consists of job versions, Input output configuration details and owner/descriptions.

Register Job Run

The job will be registered in the Marquez

Start

Update the status of the job o STARTED

Complete

Once the completes, mark it as COMPLETED

Register job Run Outputs

Details of the output locations.

Demo Time

To test out the Marquez locally and quickly.

git clone https://github.com/MarquezProject/marquez

have the docker running

Run the start command.

./docker/up.sh

once the required libraries are downloaded and the image is started you can click on http://localhost:3000/

Now since we do not have any data source connected or no jobs running you will not be seeing any entry.

But for a reference click on this.

marquez/demo.gif at main · MarquezProject/marquez

Collect, aggregate, and visualize a data ecosystem's metadata - marquez/demo.gif at main · MarquezProject/marquez

github.com

References:

Overview

Marquez is an open source metadata service for the collection, aggregation, and visualization of a data ecosystem's…

marquezproject.github.io

GitHub - MarquezProject/marquez: Collect, aggregate, and visualize a data ecosystem's metadata

Marquez is an open source metadata service for the collection, aggregation, and visualization of a data ecosystem's…

github.com

Ajith Shetty

Bigdata Engineer — Bigdata, Analytics, Cloud and Infrastructure.

Subscribe✉️ ||More blogs📝||Linked In📊||Profile Page📚||Git Repo👓

Interested in getting the weekly updates on the big data analytics around the world, do subscribe to my: Weekly Newsletter Just Enough Data

Mark the Marquez

Importance on Metadata

Why we need metadata

What is Marquez

Where does it sits?

Core concepts

Versioning

Benefits of Marquez architecture

Data model

Metadata Collection

Workflow

Register a Job

Register Job Run

Start

Complete

Register job Run Outputs

Demo Time

marquez/demo.gif at main · MarquezProject/marquez

Collect, aggregate, and visualize a data ecosystem's metadata - marquez/demo.gif at main · MarquezProject/marquez

Overview

Marquez is an open source metadata service for the collection, aggregation, and visualization of a data ecosystem's…

GitHub - MarquezProject/marquez: Collect, aggregate, and visualize a data ecosystem's metadata

Marquez is an open source metadata service for the collection, aggregation, and visualization of a data ecosystem's…

Written by Ajith Shetty