The more you realtime, The more you DRUID

Source: https://druid.apache.org/

The tools and technologies have evolved so much in the current generation that we talk about realtime more than BATCH.

Consider a New Product introduction to the market and you need to see the public acceptance rate. We cannot rally wait for the batch queries to run Extraction, Transform the data and present in the UI. We would need the real time experience.

This is one of million other real time use cases like Clickstream analysis or building analytics on top of user events or building fraud detection etc.

To your rescue, Apache Druid has evolved.

Background

Druid was created by Metamarket which is a part of Snapchat. Its digital advertising company.

The use case is to have a realtime solution for the recently advertised events and view the clickstream events.

Requirement is to:

  1. Scale to support millions of events per seconds.
  2. Which should support semi structured data as well.
  3. Support concurrency .
  4. And to support analysts with pre aggregation.

Core design requirement:

  1. Search support during realtime
  2. Time based data support
  3. Batch analytics and realtime query support

DRUID High Level Features

  1. Realtime streaming ingestion and support.
  2. High concurrency
  3. Scalable by adding servers and supporting millions of events.
  4. Column oriented storage.
  5. Query caching.
  6. Highly available
  7. Time-based partition. it automatically partition based on time and you can add your partitions. This will help in your search queries.

Where does the DRUID Fit In

At a very high level Druid sits right in between your processing layer and Presentation.

It gives you the realtime Analytics layer to speed up the computation and analysis.

High Level Architecture

Let’s talk about the architecture at a high level first and then I shall introduce you with each of the components.

Druid doesn't have an engine of its own per say. But it’s a group of project grouped together to specific set of jobs.

Master servers make sure the data is balanced and replicated.

Overlords takes the work splits it into discreet unit and hands it over to the middle managers(which takes care of indexing).

Who are responsible and managing your data in an out of the cluster based on the Query servers requests.

Under query servers we have a brokers which is responsible for user query management.

Broker is responsible for distributing the tasks, to execute the queries and merging the result before responding the users.

Deep store: Its a storage space of all the data which is available for you to query. It could be s3, ADLS or OLTP storage.

Metadata: Metadata storage is relation database: maps of where the segments and what does it contains.

Zookeeper: The whole cluster or processes are managed by the zookeeper for high availability.

Photo by Kaleidico on Unsplash

WRITE Path

MIDDLE MANAGER is responsible for all the ingestions. And it is also responsible for the queries from BATCH ingestion or stream from the QUERY SERVERS.

MIDDLE MANAGER analyse and build the aggregations, build indexes, partition the data, encode the dictionary files and writes out segment and we write it to deep storage and it can be read by historicals to be available for queries.

Middle Manager is also responsible to create data segments and data compaction processing.

Now the MASTER PROCESSES will get an update on the newly added data and it will load the data into HISTORICALS.

HISTORICALS are responsible for loading the data from deep storage to its own storage. HISTORICALS are responsible to answer user queries with old data.

HISTORICALS are a group of servers in the DEEP STORE.

After period of time or size threshold, we hand it over to the deep storage which is an immutable storage.

All these are tracked by master servers.

We can increase the number of Middle managers which will help you scale up lineally and processing parallel.

Photo by Aaron Burden on Unsplash

READ Path

Let’s look in depth, how the queries flow in and how does it get processed before returning it to the end users.

STEP 1

Query Server-BROKER is a mediator between the cluster and the end user, BROKER will read the user query and it will distribute tasks based on where the data is stored.

BROKER talks to ZOOKEEPER to understand the segment and metadata. It is also responsible for merging the data before sending it back to the client.

STEP 2

BROKER will talk to the MASTER which is under MASTER SERVER. The requested data can be coming from the historicals, which are slightly old and stored in the deep storage(eg s3) or from MIDDLE MANAGERS(in memory) which are live data.

Middle managers are super fast. In case of historicals, initially loaded in disk and same will be loaded in memory(memory mapped cache).

Remember it supports query caching for faster processing future requests.

Bonus points

  1. Ingestion tasks are parallelizable. So we can increase the number of middle managers to increase the ingestion rate and parallelism.
  2. We can increase the number of servers as part of ∫ as you start storing more and more data.
  3. We can even define a different set of servers with higher memory or lower memory based on your need.
  4. In the master servers, Coordinators can help you in data distribution, task coordination , load balancing, task state recording and segment tracking.

Upcoming Events

AUG 31 The August Apache Druid Drop-in London Apache Druid meetup

SEP 16 Fresh and Fast: Analytics for Real-time Intelligence and Ad-hoc Applications — Data Philly

Join a Druid Meetup!

Reference:

Ajith Shetty

BigData Engineer — Love for Bigdata, Analytics, Cloud and Infrastructure.

Subscribe✉️ ||More blogs📝||Linked In📊||Profile Page📚||Git Repo👓

Interested in getting the weekly updates on the big data analytics around the world, do subscribe to my: Weekly Newsletter Just Enough Data

Bigdata Engineer — Love for BigData, Analytics, Cloud and Infrastructure. Want to talk more? Ping me in Linked In: https://www.linkedin.com/in/ajshetty28/