Mark the Marquez
Importance on Metadata
Managing the data would be easier than managing its metadata.
Frankly speaking it would not be wrong to say that the Metadata is far more important than the actual data.
We can unlock a lot of possibilities with data. But having to do that with the never ending metadata changes would not be lesser difficult than building the whole infrastructure.
Why we need metadata
The data we generate is already 100 times bigger than what you have produced may be last year. And it will only grow.
We keep introducing new source systems every day. It could be batch like redshift, snowflake or may be an IoT like weather, traffic, temperature etc.
Now we cant be certain how these data gets changes over the time.
For an example. You introduce a new field. which might have not been passed along to the downsteam team and they will never know unless its specified.
Or to make your downstram application life difficult, you have changed the schema. and all the pipelines dependent on this data starts failing.
We need a strong lineage of the data as:
- What are the sources we are connecting
- How we are pulling the data
- Who are the owners of the data
- How fresh your data is
- How clean the data is
- and many more.
And having to know the above questions answered we are trying to build a data driven team with:
- The data you can rely on. Where you can know the row and column level lineage
- you can add the more and more context for your downstream application to understand the change
- self service for anyone at anytime
- unlock possibilities and many more
Enter Marquez
What is Marquez
Marquez is an open source metadata service for the collection, aggregation, and visualization of a data ecosystem’s metadata. It maintains the provenance of how datasets are consumed and produced, provides global visibility into job runtime and frequency of dataset access, centralization of dataset lifecycle management, and much more. Marquez was released and open sourced by WeWork.
Source: https://github.com/MarquezProject/marquez
The main goal of the Marques is to collect the metadata from the pipeline as it is running.
This gives the precise information of the data.
Where does it sits?
Marquez is placed in between the Sources it connects and the end users for self service, data discovery, Data quality etc.
Core concepts
At the heart of Marquez we have:
Centralised Metadata management
- Datasets
- Jobs
Modular
- Data Health
- Data Triggers
- Data dicsvovery/Search
Centralised Metadata management: Is basically capturing the dataset details and the job information like what is the job ID, what was the state of the data before and after.
All this metadata is captured through REST API
Modular: Since its modular in architecture, Marques can be used specifically to check the health and quality of the data or to search for a specific tag of the data.
Data discovery/search: It is a unified search platform where you maintain the data documentation and tags and by which you can search for your required dataset.
Data health: This will heap you answer how accurate your data is. Check for the schema. Or you may get the source where the data is stored.
Triggers: Processing of the data state bu without polling. With trigger you will answer how incomplete my data is and reduce the manual handling of the backfilling.
Versioning
Marques has a concept of versioning. Which is basically for every run and every dataset it maintains a different versions than the previous.
So having this version will help you to get the delta of previous run or datasets with current.
Benefits of Marquez architecture
Having this unique Marques architecture will help you in:
- Debugging the issue per versions and to find the root cause
- Source/destination version effected and impacted
- Failure detection and recovery
Data model
Sources can have 1or more datasets.
And you will have 1 or more data versions as the job progresses.
Jobs will have one or more than 1 version as we run them. And each run will produce a different data versions based on the input.
Metadata Collection
Currently the metadata is collected using
- Marques API
- SDKs like Python and Java
Workflow
Register a Job
Which consists of job versions, Input output configuration details and owner/descriptions.
Register Job Run
The job will be registered in the Marquez
Start
Update the status of the job o STARTED
Complete
Once the completes, mark it as COMPLETED
Register job Run Outputs
Details of the output locations.
Demo Time
To test out the Marquez locally and quickly.
have the docker running
Run the start command.
./docker/up.sh
once the required libraries are downloaded and the image is started you can click on http://localhost:3000/
Now since we do not have any data source connected or no jobs running you will not be seeing any entry.
But for a reference click on this.
References:
Ajith Shetty
Bigdata Engineer — Bigdata, Analytics, Cloud and Infrastructure.
Subscribe✉️ ||More blogs📝||Linked In📊||Profile Page📚||Git Repo👓
Interested in getting the weekly updates on the big data analytics around the world, do subscribe to my: Weekly Newsletter Just Enough Data