SQL has been the foundation of all the analytics and we have been using it for ages.

It could be Business analysts or Data scientists and data engineers we all have a common language which is SQL.

No matter how we talk the common language but there will always be a barrier when it comes to how are you going to get the data to query on top of it.

How are you going to make their life easier.

We have the data sitting at silos at 100 different places. It could be S3 with parquet file formats or redshift or Snowflake or some NOSQL like Cassandra or to make it worst we have event data coming from Kafka.

Now to support all these we would need to setup a different architecture and to bring them to single common place Or ask your analysts to go their and learn some new technologies and query.

PRESTO comes to your rescue at this very moment of time.

The selling point of PRESTO is “SQL query on everything anywhere”.

PRESTO is built in a way to support multiple different source systems with different types of data, And you can communicate them by running SQL commands.

PRESTO History

Later in the year for 2013 Facebook has open-sourced the same.

Quick Fact:

What is PRESTO

Before we deep dive let’s make it clear. PRESTO is not a database but it is a SQL query engine.

Presto is used to query your PETA Bytes of data in Data warehouse.

Features of PRESTO

PRESTO has a cost based optimiser, which can be used to translate your query into machine code and based on the cost it takes to return the data. Based on the cost it chooses the best approach to take to execute query and returns the results.

Can be scaled horizontally. Presto can be scaled horizontally to support your workload and can be reduced when you do not need the more power.

Separates computes and storage. Data can be placed anywhere at any place and you would use the compute whenever you need it.

It supports the high concurrent queries and as well as the concurrent users.

Agnostic to CLOUD/Storage/Database. PRESTO supports wide variety of storage systems like SQL or NOSQL and it can run on top of any cloud storage. Even when the backend source system changes you can still continue using PRESTO.

No Data Copy. Since Presto is not a DATABASE you do not need to have the copies of data before you query. It can be directly run on top of your current storage systems.

Supports wide variety of SQL commands. Presto supports complex data types. And at the same time it supports IF Case, Formatting, JSON, TRY, Lambda and many more.



Scheduler is responsible to orchestrate the queries across the workers.


metadata api: will help understand the tables and databases schemas columns and datatypes which are exposed to presto by data source

data location api: will help how to parallelise the query(like if cluster it can parallelise or if single then single thread the single connection)

data stream api: pulls the data bytes from the data source and converts it into row and columns


Execution engine framework

After the data is pulled out of the storage PRESTO stores the data in the Columnar format which is vectorised for better processing.

Supports ORC and Parquet: PRESTO supports the columnar storage file format

Cost based optimiser: Supports wide variety of queries by projecting the cost and reducing the execution time. It takes the Statics of the data and re-ordering the joins condition and features like repartitioning and broadcasting the data.


  1. Unzip the file
  2. cd presto-server-0.175
  3. create

5. create

6. create

7. inside bin directory run

./launcher start


Stay tuned for the part2 where I will be walking you through the step by step process to connect to S3 file system by creating a local hive metastore and local presto setup.


Ajith Shetty

Bigdata Engineer — Bigdata, Analytics, Cloud and Infrastructure.

Subscribe✉️ ||More blogs📝||Linked In📊||Profile Page📚||Git Repo👓

Interested in getting the weekly updates on the big data analytics around the world, do subscribe to my: Weekly Newsletter Just Enough Data

Bigdata Engineer — Love for BigData, Analytics, Cloud and Infrastructure. Want to talk more? Ping me in Linked In: