In the beginning of the Digital era, the every decision we take is backed by a data.
It can be as small as buying the shoes depends on the rating.
Or as big as taking over a company, depends on the customer satisfaction and the revenue it generated and forecasted.
Now more and more companies are investing and want to be data driven.
The need of the Data
- To predict the weather
- To analyse the customer behaviour based on the product they purchased
- To take a decision if the product is going to be a success in a given country
and a million more reasons.
We, as a company generates terabytes of data daily from different sources. it could be IOT, devices, Sales, Click stream, etc. Now with the humongous amount data need to be processed at a rapid speed.
For which the “BIG Data” has emerged. And it has got numerous technologies to store process and visualise the data in the real time.
Storage problem solved by Cloud provider like AWS S3, Azure Data Lake
Processing big data problem solved by tools like Spark, DBT
Data warehouse problem solved by tools like Snowflake, Redshift
Orchestration problem solved by Airflow, Dagster etc.
Now as we have solved a part of the problem, the new area to look for is how do we manage this data and maintain the metadata of the terabytes of data.
Let’s take a step back and see how the data is acquired processed and analysed for decision making.
Data driven decision is an ongoing process where which all the personas like Engineers, Scientist and Analyst would struggle to get more understanding of the data whenever wherever required.
The common questions the above personas would have are:
- Who is the data owner ?
- How frequently the data is refreshed?
- Understand the components of the data?
- How authentic the data is?
- Where are the documentation?
- What is the last change?
- Are these columns still relevant?
These questions are never ending.
The most of the time the Data Engineers/Scientists/Analysts would spend on Data Discovery more than coming up with the model to drive the decision.
Now how do we solve this problem.
So here either we can do any of the following
- Call up colleagues randomly and bug them frequently to understand the sources and data.
- Do a random search at the confluence.
- create a script to run a loop over the millions of tables in each of your data sources which would take ages.
you can leveraging the Metadata using AMUNDSEN.
Amundsen is a data discovery and metadata engine for improving the productivity of data analysts, data scientists and engineers when interacting with data. It does that today by indexing data resources (tables, dashboards, streams, etc.) and powering a page-rank style search based on usage patterns (e.g. highly queried tables show up earlier than less queried tables). Think of it as Google search for data. The project is named after Norwegian explorer Roald Amundsen, the first person to discover the South Pole.
Too much of technical terms? Let’s put it in a layman term.
It’s a one stop place for all the noob to go and look for all the data the whole team holds.
So its a platform to store the data? No, not really.
Consider this as your google page. It can search and give you the result based on your query and at the same time it gives you the popular recommendations.
In the term of Data, it’s a meta data store which will contain below at a high level.
- All the sources by which the data are being pulled and pushed
- Details of the each of the table in any storage platform like redshift, Hive, presto etc
- Column level information as what does each column means
- How frequently the data has been refreshed
- Who owns the data.
- Who all are referring this data and how authentic it is.
- Are there any dashboards connected to this dataset
- Which and all tables created/maintained by the given user.
New to the Amundsen and the Organisation? and do not know where to begin?
It will show you few popular sources and the tables based on the previous searches by the other user which you might be interested in.
Can you relate now and following this metadata store?
let’s iterate the problem and the solution Amundsen provides.
Amundsen gives you whole view of the data which is maintained by the company. Let’s say the company is acquiring the data from 100 different sources and storing them in a data warehouse tool.
Now we can get the information about this data using Amundsen as
- From which source are we getting the data
- What type of data did we bring in
- What does the each column means
- Who is the owner of the data
- How frequently its refreshed
and many more.
Ok so I hope I still have your attention 😎
Now let’s see how the Search works
We all are aware of the google page rank, and based on the relevance and the popularity you will get the details in your first page.
Similarly, the Amundsen will look for the term you have searched for and will compare and match with all the queries have been fired previously and the relevance of those tables and will display the result.
Eg: you have searched for sales_history.
Relevance will be based on the query you have fired.
Which is present in the default database created by an intern for testing.
So based on relevance you might get “sales_history”
Result based on Popularity.
When it comes to Popularity, your colleagues might be querying a table in production database but with the name “sales_forcast_history” which is been referred and used in N different places. So it contains a higher weightage.
So based on the Popularity you will get “sales_forcast_history”
Now since we got the basic understanding.
Where you query and get the output.
It’s written on ReactJS and flask web framework.
It’s backed by Neo4j. and serves the front end query.
The reason to use Neo4j is because it can put the relationship of each of the tables and components and like who created, who queried, what other tables are queried by the same user.
It’s used to serve the search request from the user.
It’s based on the relevance and the popularity which we just discussed above.
framework to fetch the metadata from the sources like hive, presto, cassandra etc and persist in the Neo4j graph.
Supports pull mode and push mode
Pull Mode: with the given time interval, the delta changes will be pulled from each of the sources and refreshes the Ne04J.
Push Mode: Upon any change in the sources, the delta change will be pushed to the Neo4J.
Too much of information already???? 😊
Before you TLDR, Let’s add a full stop for the theory part here and in the PART 2 where I shall be deep diving into the UI and shall do a demo to see how to use it.
Until then to keep you interested please refer below.
Reference and sources:
Amundsen, the leading open source data catalog
Discover & trust data for your analysis and models Be more productive by breaking silos Get immediate context into the…
GitHub - amundsen-io/amundsen: Amundsen is a metadata driven application for improving the…
Amundsen is a metadata driven application for improving the productivity of data analysts, data scientists and…
BigData Engineer — Love for Bigdata, Analytics, Cloud and Infrastructure.
Subscribe to my: Weekly Newsletter Just Enough Data