Ever since the beginning of Bigdata bubble, we have all been only talking about how we are going to Extract the Data Transform the Data and Load the Data.
But the problem here is the data and compute are tightly coupled. And we pay for when we do not need as well.
Data warehousing is something where we spend most of our Dollar to store and maintain.
But even after spending a huge chunk of money are we able to get the same amount output.
With the help of Fivetran or Stich we could answer the Ingestion problems which takes…
When we talk about real time metrics, underneath its a batch or specifically its micro batches by which our queries are run and returns the result.
As and how the data size increases, the time it takes for processing the data will also differ.
Now the need for realtime has evolved so much that we need to see the metrics in a gap of micro seconds and take actions based on that.
To help us solve this problem, LINKEDIN and UBER and partnered to create Apache PINOT.
The Apache Software Foundation Announces Apache® Pinot™ as a Top-Level Project.
What on earth does “Parallelising the parallel jobs” mean??
Without going in depth, On a layman term,
Spark creates the DAG or the Lineage based on the sequence we have created the RDD, applied transformations and actions.
It applies the Catalyst optimiser on the dataframe or dataset to tune your queries. but what it doesn’t do is, running your function in parallel to each other.
We always tend to think that the Spark is a framework which splits your jobs into tasks and stages and runs in parallel.
In a way it is 100% true. …
Spark is used as an ingestion tool in more and more companies. It is a perfect replacement for any kind of commercial applications like Talend or Informatica.
Spark can connect to multiple different source systems, it could be standard databases like SQL Server, ORACLE or even NOSQL databases like Cassandra, Mongodb etc.
Each of the source systems have their own level of optimisation available when we query.
It takes user query as an input and tries to optimise it and returns only the required data.
Spark Predicate pushdown uses the source system’s built in optimisation techniques to filter out the…
Parquet is an efficient row columnar file format which supports compression and encoding which makes it even more performant in storage and as well as during reading the data
Parquet is a widely used file format in the Hadoop eco system and its widely received by most of the data science world mainly due to the performance.
We are aware that the parquet as a row columnar file format, but it does more than that under the hood to efficiently store the data.
In this blog we will be talking in depth of the parquet file format and why is…
Expectation is a very strong word. In the world of Data the expectation grows exponentially to your data size.
But how easy or difficult to meet these expectations.
Data Engineers spend most of their time in creating the pipeline, managing them, giving the data to right set of people.
But can we guarantee that the data is consistent and their are no outlier?
The tools and technologies have evolved so much in the current generation that we talk about realtime more than BATCH.
Consider a New Product introduction to the market and you need to see the public acceptance rate. We cannot rally wait for the batch queries to run Extraction, Transform the data and present in the UI. We would need the real time experience.
This is one of million other real time use cases like Clickstream analysis or building analytics on top of user events or building fraud detection etc.
To your rescue, Apache Druid has evolved.
Druid was created by…
We are in a golden age of data and the amount of data produced every second is 100 times bigger and larger than what we have generated may be a couple years of ago.
I have already talked about the The need of the Data in my previous blog https://ajithshetty28.medium.com/its-am-un-dsen-which-every-data-driven-company-needs-part-1-da07019c4c8f and how did Amundsen has solved the problem at Lyft.
But just lets re-iterate a few points.
We have a new source coming up in almost every week. It could be a cloud source system like redshift, Athena or a private organisations like snowflake.
Each of these sources are trying…
In the beginning of the Digital era, the every decision we take is backed by a data.
It can be as small as buying the shoes depends on the rating.
Or as big as taking over a company, depends on the customer satisfaction and the revenue it generated and forecasted.
Now more and more companies are investing and want to be data driven.
and a million more reasons.
Delta Lake is the term you would have heard about or read about in 100s of blogs or you may have even used it in your project.
The intentions of this blog is not only talk about Delta Lake and its concepts, but to familiarise you on how does it work under the hood.
Before we get any deeper let’s set the base.
Delta Lake is an open format storage layer that delivers reliability, security and performance on your data lake — for both streaming and batch operations. …