Then SQL, now PRESTO SQL
SQL has been the foundation of all the analytics and we have been using it for ages.
It could be Business analysts or Data scientists and data engineers we all have a common language which is SQL.
No matter how we talk the common language but there will always be a barrier when it comes to how are you going to get the data to query on top of it.
How are you going to make their life easier.
We have the data sitting at silos at 100 different places. It could be S3 with parquet file formats or redshift or Snowflake or some NOSQL like Cassandra or to make it worst we have event data coming from Kafka.
Now to support all these we would need to setup a different architecture and to bring them to single common place Or ask your analysts to go their and learn some new technologies and query.
PRESTO comes to your rescue at this very moment of time.
The selling point of PRESTO is “SQL query on everything anywhere”.
PRESTO is built in a way to support multiple different source systems with different types of data, And you can communicate them by running SQL commands.
Presto is developed by Facebook when they realised that the HIVE could not be something which would serve the purpose of all the sources.
Later in the year for 2013 Facebook has open-sourced the same.
Did you know that the AWD Athena is built on top of PRESTO.
What is PRESTO
PRESTO is a distributed query engine which can run on top of heterogeneous datasources using the API. To be more specific, PRESTO can be used to query the distributed data sources.
Before we deep dive let’s make it clear. PRESTO is not a database but it is a SQL query engine.
Presto is used to query your PETA Bytes of data in Data warehouse.
Features of PRESTO
PRESTO uses the SQL to run on top of any data sources.
PRESTO has a cost based optimiser, which can be used to translate your query into machine code and based on the cost it takes to return the data. Based on the cost it chooses the best approach to take to execute query and returns the results.
Can be scaled horizontally. Presto can be scaled horizontally to support your workload and can be reduced when you do not need the more power.
Separates computes and storage. Data can be placed anywhere at any place and you would use the compute whenever you need it.
It supports the high concurrent queries and as well as the concurrent users.
Agnostic to CLOUD/Storage/Database. PRESTO supports wide variety of storage systems like SQL or NOSQL and it can run on top of any cloud storage. Even when the backend source system changes you can still continue using PRESTO.
No Data Copy. Since Presto is not a DATABASE you do not need to have the copies of data before you query. It can be directly run on top of your current storage systems.
Supports wide variety of SQL commands. Presto supports complex data types. And at the same time it supports IF Case, Formatting, JSON, TRY, Lambda and many more.
Coordinator is a brain of the system. It is responsible for parsing , planning and analysing the user queries. Coordinator interacts with the Worker nodes to serve the client needs. Coordinator uses the REST API to talk to the worker nodes.
Scheduler is responsible to orchestrate the queries across the workers.
We can have multiple workers for query processing. The coordinator sends the requests to worker nodes and workers will read the data from the sources and runs query. Workers will apply the projection, aggregation and joins to compute the query on top of the data and shall return to the client
metadata api: will help understand the tables and databases schemas columns and datatypes which are exposed to presto by data source
data location api: will help how to parallelise the query(like if cluster it can parallelise or if single then single thread the single connection)
data stream api: pulls the data bytes from the data source and converts it into row and columns
PRESTO does support the connection to the famous Visualisation tools like TABLEAU or Superset using JDBS/ODBC connectors and as well as you can connect it to Jupyter as well.
Execution engine framework
MPP execution: It supports the MPP style memory execution engine to support petabytes of data by reducing the execution time.
After the data is pulled out of the storage PRESTO stores the data in the Columnar format which is vectorised for better processing.
Supports ORC and Parquet: PRESTO supports the columnar storage file format
Cost based optimiser: Supports wide variety of queries by projecting the cost and reducing the execution time. It takes the Statics of the data and re-ordering the joins condition and features like repartitioning and broadcasting the data.
- Download the setup https://repo1.maven.org/maven2/com/facebook/presto/presto-server/0.175/presto-server-0.175.tar.gz
- Unzip the file
- cd presto-server-0.175
- create config.properties
5. create jvm.properties
6. create node.properties
7. inside bin directory run
Stay tuned for the part2 where I will be walking you through the step by step process to connect to S3 file system by creating a local hive metastore and local presto setup.
Presto | Distributed SQL Query Engine for Big Data
Presto is an open source distributed SQL query engine for running interactive analytic queries against data sources of…
GitHub - prestodb/presto: The official home of the Presto distributed SQL query engine for big data
Presto is a distributed SQL query engine for big data. See the User Manual for deployment instructions and end user…
Bigdata Engineer — Bigdata, Analytics, Cloud and Infrastructure.
Interested in getting the weekly updates on the big data analytics around the world, do subscribe to my: Weekly Newsletter Just Enough Data