Photo by ArtisanalPhoto on Unsplash

What on earth does “Parallelising the parallel jobs” mean??

Without going in depth, On a layman term,

Spark creates the DAG or the Lineage based on the sequence we have created the RDD, applied transformations and actions.

It applies the Catalyst optimiser on the dataframe or dataset to tune your queries. but what it doesn’t do is, running your function in parallel to each other.

We always tend to think that the Spark is a framework which splits your jobs into tasks and stages and runs in parallel.
In a way it is 100% true. …

Photo by Jason Dent on Unsplash

We have 100s of blogs and pages which talks about caching and persist in spark.

In this blog, the intention is not to only talk about the cache or persist but to take this one step ahead and walking you through how does it work on top of Tables, Views and DataFrames.

In case of DataFrame we are aware that the cache or persist command doesn't cache the data in memory immediately as it’s a transformation.

Upon calling any action like count it will materialise the data.

but in case of spark.sqlContext.cacheTable(dbname.table) will materialise it?

NO, So again, the .cacheTable…

Photo by Jakub Skafiriak on Unsplash

Spark is used as an ingestion tool in more and more companies. It is a perfect replacement for any kind of commercial applications like Talend or Informatica.

Spark can connect to multiple different source systems, it could be standard databases like SQL Server, ORACLE or even NOSQL databases like Cassandra, Mongodb etc.

Each of the source systems have their own level of optimisation available when we query.

It takes user query as an input and tries to optimise it and returns only the required data.

Spark Predicate pushdown uses the source system’s built in optimisation techniques to filter out the…

Photo by Jakub Skafiriak on Unsplash

Being a data engineer we are challenged everyday with not so usual cases to solve.

We cannot apply the same thought process to all the places. We need to think of the effort it takes and how we can reduce the time.

But at the same time, we are supposed to know the internals before we jump in and come to a conclusion.

In this Blog, I shall be talking about few tips and tricks and some lesser known facts in Spark which will come handy for most of our Data Engineer fellows.

1. count() always trigger an evaluation of each row?

We are aware that the .cache or…

Photo by Mr Cup / Fabien Barral on Unsplash

Parquet is an efficient row columnar file format which supports compression and encoding which makes it even more performant in storage and as well as during reading the data

Parquet is a widely used file format in the Hadoop eco system and its widely received by most of the data science world mainly due to the performance.

We are aware that the parquet as a row columnar file format, but it does more than that under the hood to efficiently store the data.

In this blog we will be talking in depth of the parquet file format and why is…


We have 100s of blogs talking about Dagster and what does it do. And having a discussion between airflow vs Dagster for a long long time.

But here we are talking a step back and talking about the origin of the Dagster and shall answer below questions.

  1. How did it start
  2. Why did it come to an existence
  3. Why do we need it

A Quick Intro to Dagster

Dagster is developed by Elementl founded by Nick Schrock CEO of Elementl, a company aiming to reshape the data management ecosystem, and the creator of Dagster, a new programming model for data processing.

Just before we spill…

Photo by Jessica Johnston on Unsplash

Partitioning and bucketing are the most common optimisation techniques we have been using in Hive and Spark.

Here in this blog I shall be talking specifically on the bucketing and how is it should be used.

Bucketing works only when the given cases are met. And we shall be talking about those and how to get the most out of bucketing.

What is bucketing?

In Spark and Hive Bucketing is a optimisation technique. We provide the column by which the data needs to be partitioned.

We need to make sure that the bucketing conditions are met to get the most out of it…

Photo by AbsolutVision on Unsplash

This is a second part of Lesser Known Facts/Short cuts in Spark where I will be talking about few of the unknown and some interesting facts about Spark.

Apache Spark itself is vast and it contains 100s of methods to do the same function.

But one thing to note that it doesn't work similar to each other under the hood.

Here the intention of this blog is to make you familiar with the internals of Spark which we might not be aware and how does it work under the hood, plus to give some tips to use them efficiently.


Ajith Shetty

BigData Engineer — Love for BigData, Analytics, Cloud and Infrastructure. Want to talk more? Ping me in Linked In:

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store