SuperGlue, New way of interpreting the Lineage

source: screenshot after local installation of superglue

Being in the world of Data is so fascinating and at the same time it is super difficult as well.

You all can relate to 1 fact is the defining the dependency.

We could have 100s of dependency before you generate 1 table for your presentation layer.

And all the 100 dependency could have they're own set of dependencies.

Based on all the inter dependent set of tables and sources it is becoming difficult to manage and monitor the jobs and provide the business values.

Now at this point we definitely need the the Lineage Tracking tool which can help you find all these inter dependencies and backtrack them.

Enter Super Glue(By Intuit).

What is Super Glue

SuperGLue is a Lineage tracking tool which will help you visualise your complex data pipelines and helps you to manage and monitor them.

Why do we need to Lineage tracking tool

No matter how resilient pipeline you build with all type of dependencies, we cannot say that it will never fail for any reason.

There could be 100s of reasons for which you job might fail and it will contribute in showing wrong data to the business.

Reasons could be:

Code/SQL changes: You make a small change in the query and all your downstream application will be fed with incorrect data.

Platform Issues: A small network issue could cause your platform to go down.

Job issues: You made changes in the job scheduling which causes an issue in the dependency.

Data Source issues: A change in the source would cause a problem in the destination.

The need of SuperGlue

The frequent question we ask ourselves when your pipeline fails are:

  1. Which job failed

Now this is a never ending cycle.

Whenever your pipeline fails, you will be asked with these questions and will go back and figure it out again and again.

In some cases it could takes hours and sometimes may be days.

Having the data lineage tool it would become easier for you to backtrack your job from the place where it failed and look for any kind of anomalies.

Features of SuperGlue

  1. SuperGlue will help you find the the job dependencies.

DEMO

git clone https://github.com/intuit/superglue.gitdocker-compose -f deployments/development/docker-compose.yml up

Note: you may need to increase the memory for elastic search container if it fails to launch.

http://localhost:8080

To install the superglue command-line client, run

./gradlew installDistexport PATH=”${HOME}/.superglue/bin:${PATH}”

cd examples
superglue init — database

# In examples/
superglue parse

superglue elastic — load
http://localhost:8080

Since this is a local development environment you will not be able to see most of the features.

Reference:

Ajith Shetty

Bigdata Engineer — Bigdata, Analytics, Cloud and Infrastructure.

Subscribe✉️ ||More blogs📝||LinkedIn📊||Profile Page📚||Git Repo👓

Interested in getting the weekly newsletter on the big data analytics around the world, do subscribe to my: Weekly Newsletter Just Enough Data