Mark the Marquez

source: https://marquezproject.github.io/marquez/

Importance on Metadata

Managing the data would be easier than managing its metadata.

Frankly speaking it would not be wrong to say that the Metadata is far more important than the actual data.

We can unlock a lot of possibilities with data. But having to do that with the never ending metadata changes would not be lesser difficult than building the whole infrastructure.

Why we need metadata

The data we generate is already 100 times bigger than what you have produced may be last year. And it will only grow.

We keep introducing new source systems every day. It could be batch like redshift, snowflake or may be an IoT like weather, traffic, temperature etc.

Now we cant be certain how these data gets changes over the time.

For an example. You introduce a new field. which might have not been passed along to the downsteam team and they will never know unless its specified.

Or to make your downstram application life difficult, you have changed the schema. and all the pipelines dependent on this data starts failing.

We need a strong lineage of the data as:

  1. What are the sources we are connecting
  2. How we are pulling the data
  3. Who are the owners of the data
  4. How fresh your data is
  5. How clean the data is
  6. and many more.

And having to know the above questions answered we are trying to build a data driven team with:

  1. The data you can rely on. Where you can know the row and column level lineage
  2. you can add the more and more context for your downstream application to understand the change
  3. self service for anyone at anytime
  4. unlock possibilities and many more

Enter Marquez

What is Marquez

Marquez is an open source metadata service for the collection, aggregation, and visualization of a data ecosystem’s metadata. It maintains the provenance of how datasets are consumed and produced, provides global visibility into job runtime and frequency of dataset access, centralization of dataset lifecycle management, and much more. Marquez was released and open sourced by WeWork.

Source: https://github.com/MarquezProject/marquez

The main goal of the Marques is to collect the metadata from the pipeline as it is running.

This gives the precise information of the data.

Where does it sits?

Marquez is placed in between the Sources it connects and the end users for self service, data discovery, Data quality etc.

Core concepts

At the heart of Marquez we have:

Centralised Metadata management

  1. Datasets
  2. Jobs

Modular

  1. Data Health
  2. Data Triggers
  3. Data dicsvovery/Search

Centralised Metadata management: Is basically capturing the dataset details and the job information like what is the job ID, what was the state of the data before and after.

All this metadata is captured through REST API

Modular: Since its modular in architecture, Marques can be used specifically to check the health and quality of the data or to search for a specific tag of the data.

Data discovery/search: It is a unified search platform where you maintain the data documentation and tags and by which you can search for your required dataset.

Data health: This will heap you answer how accurate your data is. Check for the schema. Or you may get the source where the data is stored.

Triggers: Processing of the data state bu without polling. With trigger you will answer how incomplete my data is and reduce the manual handling of the backfilling.

Versioning

Marques has a concept of versioning. Which is basically for every run and every dataset it maintains a different versions than the previous.

So having this version will help you to get the delta of previous run or datasets with current.

Benefits of Marquez architecture

Having this unique Marques architecture will help you in:

  1. Debugging the issue per versions and to find the root cause
  2. Source/destination version effected and impacted
  3. Failure detection and recovery

Data model

Sources can have 1or more datasets.

And you will have 1 or more data versions as the job progresses.

Jobs will have one or more than 1 version as we run them. And each run will produce a different data versions based on the input.

Metadata Collection

Currently the metadata is collected using

  1. Marques API
  2. SDKs like Python and Java

Workflow

Register a Job

Which consists of job versions, Input output configuration details and owner/descriptions.

Register Job Run

The job will be registered in the Marquez

Start

Update the status of the job o STARTED

Complete

Once the completes, mark it as COMPLETED

Register job Run Outputs

Details of the output locations.

Demo Time

To test out the Marquez locally and quickly.

git clone https://github.com/MarquezProject/marquez

have the docker running

Run the start command.

./docker/up.sh

once the required libraries are downloaded and the image is started you can click on http://localhost:3000/

Now since we do not have any data source connected or no jobs running you will not be seeing any entry.

But for a reference click on this.

References:

Ajith Shetty

Bigdata Engineer — Bigdata, Analytics, Cloud and Infrastructure.

Subscribe✉️ ||More blogs📝||Linked In📊||Profile Page📚||Git Repo👓

Interested in getting the weekly updates on the big data analytics around the world, do subscribe to my: Weekly Newsletter Just Enough Data

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Ajith Shetty

Ajith Shetty

Bigdata Engineer — Love for BigData, Analytics, Cloud and Infrastructure. Want to talk more? Ping me in Linked In: https://www.linkedin.com/in/ajshetty28/