2025-02-07

Real-time analysis Part 2: A Closer Look at Apache Druid

Dima Baranetskyi

Data Engineer

In Part 1, we embarked on a journey through the landscape of real-time analytics databases. Now, let's dive deeper into what I find most fascinating about Apache Druid. Whether you're still exploring or already considering Druid for your use case, this deep dive should help illuminate your path forward. In this post, I am going to focus fully on Apache Druid. If you are still exploring, I suggest reading Part 1. However, if you are already considering Apache Druid and want to understand it from a technical point of view, this article may help you make the final decision.

Is Apache Druid the right tool for my use case?

“We become what we behold. We shape our tools, and thereafter our tools shape us.”

Marshall McLuhan

Your use case does not fit into your current technical landscape. And you want data to be analysed near or in real time. You want high availability and concurrency with sub-second queries at the lowest cost and for any scale.

All these capabilities are present in the Druid toolset. But what it does not promise?

Apache Druid is not a replacement for your Data Warehouse tools. It is a complementation for specific use cases.
Apache Druid is not a replacement for your search systems.
Apache Druid is not a replacement for your timeseries solution. For complex timeseries operations, dedicated timeseries databases significantly outperform Druid, though Druid can efficiently handle basic timeseries tasks.

A crucial consideration — Druid’s impressive capabilities come with their own set of trade-offs:

Druid is fundamentally an event-oriented system, not an entity-oriented system. This means every row MUST have a timestamp, which unifies dimensions about entities (like browsers, locations, or actions) into an immutable record of what was true at that moment in time. This is not just a technical requirement — it shapes how Druid stores, manages, and queries data.
Approximation. This feature uses the Apache DataSketches library which offers excellent memory efficiency and performance for high-cardinality aggregations at the cost of absolute precision, where accuracy typically ranges within ±2% depending on the sketch type and configuration. Take into account that financial transactions, healthcare measurements, and legal evidence systems require exact counts and perfect accuracy for regulatory compliance, where even small deviations could lead to significant monetary impact or life-critical decisions. While this feature can be disabled it may lead to a significant decrease in performance on these types of calculations.
Apache Druid is a Column-Oriented Storage. While excellent for analytical queries, it’s inefficient for row-level operations and single-record lookups compared to traditional databases.
Relatively complex cluster topology. Single-server deployment offers simplicity in management and lower operational costs but lacks high availability, scalability, and specialised node optimisation. While multi-server deployment with specialised nodes (Router, Broker, Historical, MiddleManager, Coordinator, Overlord) provides along with its dependencies better performance, availability, and scalability at the cost of increased operational complexity and resource requirements. In the vast majority of deployments, Druid is deployed clustered, not as a single server.

Still, having doubts and questions? Good news! There are several ways to explore Apache Druid:

Try it yourself — there is a GitHub repository maintained by the members of the community. It includes local setup using Docker and multiple examples that highlight its capabilities
People behind Apache Druid — Imply, the main contributor to the project. You can easily get in touch with advisers from Imply, and you will get answers to all your questions
Learn Apache Druid — there are several free courses that not only come with certification about completion but also let you dive into its internals
Apache Druid Community — I found Druid’s Slack very active and friendly. People ask questions of different levels, and what is most important is that they get answers

Well, if you reached this point of the article, most probably you already know if you are still interested. And if the answer is “yes”, go ahead and let’s explore more!

What kind of software is Apache Druid?

“Any sufficiently advanced technology is indistinguishable from magic.”

Arthur C. Clarke

Apache Druid, in its essence, is a Java application, and it is based on other Java solutions. Several of them are worth mentioning:

Zookeeper (doubting already ;) )
Calcite (SQL comes from there)
DataSketches (mentioned above approximation tool)

Let’s explore one at a time.

Apache Zookeeper

While some strong players on the market are getting rid of ZooKeeper as a dependency, for Apache Druid ZooKeeper provides critical coordination services for:

Coordinator leader election (using Curator LeaderLatch recipe)
Process discovery via announcements path
Overlord leader election
Overlord and Middle Manager task management

Though ZooKeeper acts as a central dependency, Druid is designed to continue serving queries using its last known good state during ZooKeeper unavailability. This “last known good state” principle means each node caches its latest configuration from ZooKeeper locally, making Druid resilient for read operations while management operations may be impacted. Note: As of Druid 31.0.0, segment loading is no longer dependent on ZooKeeper, representing a significant reduction in ZooKeeper’s role in the architecture.

Apache Calcite

Calcite provides SQL parsing, optimisation, and query planning capabilities, allowing Druid to support SQL queries and complex transformations, while Druid enhances it with custom rules and optimisations for time-series data. This integration relies on two key components: Calcite’s core query processing and Avatica, its framework for building database drivers. Avatica provides the wire API between clients and servers, supporting both JSON and Protobuf Buffers protocols, which enables Druid’s SQL interface through JDBC drivers and HTTP APIs.

However, this integration has limitations in supporting full SQL syntax (like complex subqueries, all join types) and sometimes requires falling back to native Druid queries for better performance. Calcite-based SQL in Druid lacks support for certain SQL features (window functions are limited, complex nested subqueries may not work), has a performance overhead compared to native queries in some cases, and its query optimisation may not always choose the most efficient execution plan for Druid’s specific storage and processing model — particularly for time-based operations where native Druid queries might perform better. While Avatica’s flexibility allows for various client implementations, the performance considerations remain. I highly suggest reading official Calcite and Avatica documentation for better understanding.

Apache DataSketches

DataSketches comes as a default extension in Druid, providing efficient approximate algorithms for distinct counting, quantile calculation, and set operations that scale well with large data volumes and support streaming ingestion. While results are probabilistic with a typical error margin of ±2%, the approach significantly reduces storage requirements compared to alternatives. For example, using a sketch to track unique IP addresses periodically requires much less storage than keeping raw IP addresses in a rollup table. However, due to the probabilistic nature of the results, use cases requiring exact precision (like financial calculations or legal compliance) should carefully evaluate if this approach is suitable. The error bounds are well-defined and configurable through sketch size parameters.

Java world

What if your organisation lacks Java expertise? Well, core operations like querying, ingestion, and basic configuration can be managed through REST APIs, SQL, and JSON configurations without Java knowledge, but troubleshooting performance issues, handling production incidents, and implementing custom extensions will require Java expertise for JVM tuning and understanding internal processing.

Minimal Java expertise must be acquired to handle:

JVM memory management, particularly for avoiding Out of Memory (OOM) errors during ingestion by properly sizing incoming data splits
Basic Java thread analysis for troubleshooting deadlocks and performance issues
Reading Java stack traces for effective incident response

The ability to write expert-level Java code is not required, but understanding these operational aspects is crucial for maintaining a healthy Druid cluster.

Native Ingestion capabilities

“Men have become the tools of their tools”

Henry David Thoreau

Apache Druid offers flexible data ingestion from various sources including cloud storage (S3, GCS, Azure), HDFS, HTTP endpoints, and local files. It supports both batch and streaming ingestion, with capabilities for parallel processing, authentication, and data filtering. The system can read directly from existing Druid segments and modern table formats like Delta Lake and Iceberg. Key features include parallel task execution, multiple authentication methods, and configurable performance settings.

Amazon Kinesis Integration

Exactly-once ingestion guarantees using shard and sequence number mechanisms
Supervisor oversight for managing task lifecycles and failures
Configurable fetch settings with throughput controls
Built-in AWS authentication support
Extensible data formats including JSON, CSV, Avro, and Protobuf

Apache Kafka Integration

Exactly-once ingestion using partition and offset mechanisms
Support for Kafka 0.11.x and higher versions
Flexible topic pattern matching for multi-topic ingestion
Comprehensive metadata parsing capabilities
Configurable consumer properties

Amazon S3 Integration

Native S3A filesystem support with optimized performance
Built-in data compression handling (gzip, bz2, zst)
Path-based data discovery and filtering
Parallel data ingestion capabilities
Support for AWS IAM authentication and encryption

Google Cloud Storage Integration

Native GCS connector support
Built-in authentication via service accounts
Path prefix and wildcard filtering
Parallel read capabilities
Automatic retry and error handling

Azure Blob Storage Integration

Native Azure Blob storage support
Managed identity authentication
Container and path-based filtering
Parallel download capabilities
Built-in retry mechanisms

HDFS Integration

Support for various Hadoop distributions
Kerberos authentication integration
Directory and file pattern matching
Parallel data processing capabilities
Configurable input formats and compression

Local File System Integration

Direct file system access for batch ingestion
Pattern-based file monitoring
Support for compressed file formats
Directory recursion capabilities
Configurable file completion detection

HTTP Input Source

RESTful endpoint data ingestion
Custom header and authentication support
Configurable retry mechanisms
Response validation capabilities
Support for various data formats

Delta Lake Integration

Time-travel query capabilities
Schema evolution support
Transaction log parsing
Partition pruning optimization
Metadata handling capabilities

Apache Iceberg Integration

Table format compatibility
Schema evolution support
Snapshot isolation capabilities
Partition layout handling
Metadata catalog integration

This comprehensive set of ingestion capabilities makes Apache Druid a versatile solution for modern data architectures. The platform’s ability to handle both batch and streaming data, coupled with its extensive support for various data sources and formats, enables organizations to build robust, scalable data pipelines. The built-in security features, performance optimizations, and failure handling mechanisms ensure reliable data ingestion in production environments. Whether dealing with cloud-native applications, on-premises systems, or hybrid architectures, Druid’s ingestion framework provides the flexibility and reliability needed for real-time analytics and data processing at scale.

Understanding Apache Druid’s Architecture

“Architecture should speak of its time and place, but yearn for timelessness.”

Frank Gehry

Before we delve into Druid’s internals, think of its architecture as a well-orchestrated symphony. Each component plays its unique part, creating a harmonious whole that’s greater than the sum of its parts.

The magic of Druid’s performance lies in its thoughtfully designed distributed architecture. Let’s dive into how Druid manages to handle massive amounts of data while maintaining its lightning-fast query response times.

The Big Picture

Druid’s architecture is built with cloud deployments in mind, emphasizing fault tolerance and operational flexibility. Think of it as a well-orchestrated team where each member has a specific role but can work independently. The beauty of this design is that if one component faces issues, it doesn’t bring down the entire system — a true demonstration of resilient architecture.

Core Services: The Building Blocks

Druid’s architecture consists of several specialized services, each with a distinct responsibility:

Master Services (The Coordinators)

The Coordinator service manages data availability and distribution across the cluster
The Overlord service controls data ingestion workloads and task assignments
For smaller deployments, these can run as a single service, but larger clusters might benefit from separating them for better resource management

Query Services (The Customer-Facing Layer)

The Broker service handles external client queries, acting as the primary entry point
The Router service provides a unified API gateway, directing traffic to appropriate services
This layer also includes a web console for cluster management and monitoring

Data Services (The Workhorses)

Historical services store and process queryable data, managing the heavy lifting for data retrieval
Middle Managers handle data ingestion with their Peon workers (alternatively, you can use either the Indexer service or the experimental MM-less approach which uses Kubernetes to manage tasks directly as jobs instead of using Middle Managers)
These services can be deployed separately for high-performance clusters to avoid resource contention

The Storage Symphony

Druid’s storage architecture is particularly fascinating, operating across multiple layers:

Deep Storage: Think of this as Druid’s persistent backup system. Whether you’re using S3, HDFS, or a local filesystem, this layer stores all ingested data and serves as the source of truth. This was a deliberate design choice to leverage existing BLOB storage systems that already handle encryption, massively-parallel reads/writes, and high availability — rather than reinventing these capabilities. This means even if all data servers fail, Druid can reliably rebuild from deep storage using these battle-tested storage solutions.
Historical Storage: This is where the performance magic happens. Historical nodes manage segments that can be cached on local disk and loaded into memory on demand for synchronous query performance. While Druid offers flexibility through its asynchronous API (MSQ) to access segments directly from deep storage, and provides tiering options for balancing performance and resource usage, some level of caching is required for the table to be queryable through the synchronous API.
Metadata Storage: Usually a traditional RDBMS like PostgreSQL or MySQL, this layer keeps track of cluster metadata, segment information, and task management details.

Real-time Processing Flow

When data flows through Druid, it follows an elegant path:

Ingestion tasks (running on Middle Managers or Indexers) consume data from sources
Data is processed and stored in segments
These segments are first stored in deep storage
Historical nodes load segments according to retention rules
Brokers coordinate query execution across the cluster

Modern Architectural Considerations

Druid’s architecture shines particularly bright in modern cloud deployments:

Kubernetes Ready: The architecture naturally fits container orchestration patterns
Scalability: Each service can be scaled independently based on workload
Resource Optimization: The ability to separate services allows for efficient resource utilization
Operational Flexibility: The architecture supports both simple single-server deployments and complex distributed clusters

External Dependencies: The Support System

Druid relies on three external systems that complement its architecture:

ZooKeeper for service discovery and coordination
Deep storage (like S3 or HDFS) for data persistence
Metadata storage (like PostgreSQL) for system metadata

What makes this architecture particularly elegant is how it balances complexity with practicality. Each component has a clear responsibility, making the system both maintainable and truly elastic. For example, you can scale ingestion capacity by dynamically adding or removing Middle Manager and Peon nodes based on your workload — spinning them up during heavy ingestion periods and scaling them down afterwards. Similarly, query capacity can be adjusted by adding Historical nodes during high-demand periods and removing them when the demand decreases. This granular control over resources ensures you’re not just scalable in theory, but efficiently scalable in practice.

Deploying Apache Druid: Choosing Your Path

“It's not the Destination, It's the journey”

Ralph Waldo Emerson

Let’s explore the three main paths for deploying Apache Druid: Kubernetes Operator, Helm Charts, and managed services. Each approach offers distinct advantages and trade-offs that can significantly impact your real-time analytics journey.

Kubernetes Operator: The Native Path

The Druid Kubernetes Operator provides a native approach to running Druid on Kubernetes, offering fine-grained control over your deployment.

Key Configurations:

Supports both Deployments and StatefulSets for Druid nodes (defaulting to StatefulSets)
Node-specific configurations via nodeSpec
Common cluster configurations via clusterSpec
Flexible runtime properties management through ConfigMaps

Advantages:

Deep integration with Kubernetes native features
Fine-grained control over each component
Direct access to Kubernetes features like auto-scaling and monitoring
Easier integration with existing Kubernetes workflows

Trade-offs:

Requires Kubernetes expertise
More manual configuration needed
Higher operational overhead
Steeper learning curve for teams new to Kubernetes

Helm Chart: The Streamlined Path

Helm Charts provide a more packaged approach, making it easier to get started while still maintaining flexibility.

Key Configurations:

Services configuration (Router, Broker, Coordinator, Overlord, Historical)
Support for different tiers of Historical nodes (e.g., hot and cold)
Built-in worker categories
HPA (Horizontal Pod Autoscaling) with zero-scale capability
Integrated PostgreSQL and ZooKeeper management

Advantages:

Simpler deployment process
Pre-configured best practices
Easy version management and upgrades
No Kubernetes operator overhead
Flexible configuration through values.yaml

Trade-offs:

Less granular control compared to the operator
May need customization for specific use cases
Some advanced Kubernetes features might be harder to access
Dependencies on Helm ecosystem

Managed Service (Imply Polaris): The Hands-off Path

For teams looking to focus on analytics rather than infrastructure, managed services like Imply Polaris offer a compelling option.

Advantages:

Zero infrastructure management
Automated scaling and optimization
Built-in security and compliance (SOC 2, HIPAA)
Automatic backups and disaster recovery
Professional support and expertise
Quick start with minimal configuration

Trade-offs:

Higher cost for small deployments
Less control over infrastructure
Potential vendor lock-in
Limited customization options
Dependency on provider’s upgrade schedule

Making the Choice

Consider these factors when choosing your deployment path:

Team Expertise:

Kubernetes experts? → Operator
DevOps-capable? → Helm Chart
Focus on analytics? → Managed Service

Operational Overhead:

High control needs? → Operator
Balanced approach? → Helm Chart
Minimal ops? → Managed Service

Cost Considerations:

Infrastructure costs only → Operator/Helm
Willing to pay for convenience → Managed Service

Scale Requirements:

Small to medium → Any option
Large scale → Consider Operator for maximum control
Variable/unpredictable → Managed Service for easy scaling

The beauty of these options is that you’re not locked in — you can start with one approach and migrate to another as your needs evolve. Many teams start with a managed service to understand Druid better, then move to self-hosted solutions as they grow their expertise.

Extending Apache Druid Yourself: A Developer’s Guide

“Everything that is really great and inspiring is created by the individual who can labor in freedom.”

Albert Einstein

Real-time data processing capability is a crucial asset in today’s data-driven world. Apache Druid, with its flexible extension system, allows developers to enhance its functionality to meet specific needs. Let’s explore how you can extend Druid’s capabilities and contribute to its ecosystem.

Understanding Druid’s Extension System

At its core, Druid uses a module system that enables runtime addition of extensions. This system leverages Guice for dependency injection and manages the object graph of the Druid process. While it’s theoretically possible to modify almost anything through Guice bindings, there are several common extension points that developers typically focus on.

Key Extension Points

Deep Storage Implementation

If you need to integrate with a specific storage system, you can create a new deep storage implementation by extending:

org.apache.druid.segment.loading.DataSegment*
org.apache.druid.tasklogs.TaskLog*

Input Source and Format

For handling new data sources or formats, you can implement:

InputSource - Defines where input data is stored
InputEntity - Specifies how data can be read in parallel
InputFormat - Determines how your data is formatted
InputEntityReader - Handles the parsing of your data format

Query Capabilities

To add new query functionality, implement:

QueryRunnerFactory
QueryToolChest
Query

Aggregation Features

For custom analytics, extend:

AggregatorFactory
Aggregator
BufferAggregator

Extending Apache Druid allows you to customize its functionality to meet your specific needs. Whether you’re adding support for a new storage system, implementing custom query types, or creating new aggregators, Druid’s extension system provides a flexible foundation for enhancement.

Remember to contribute back to the community if you create something useful! The Druid ecosystem grows stronger with each contribution, and your extension might help others facing similar challenges.

Apache Druid SQL Overview

“Simplicity is the ultimate sophistication.”

Leonardo da Vinci

Apache Druid offers robust SQL support through its Druid SQL interface, which provides a powerful way to query data stored in Druid datasources. The SQL interface translates standard SQL queries into Druid’s native query language, making it accessible for users familiar with SQL while leveraging Druid’s high-performance query engine.

Core SQL Support

Druid SQL supports a comprehensive SELECT query structure with the following key components:

Full support for standard clauses like SELECT, FROM, WHERE, GROUP BY, HAVING, ORDER BY, and LIMIT
Advanced features including EXPLAIN PLAN for query analysis
Support for subqueries, joins, and table-level operations
Dynamic parameters using question mark (?) syntax for parameterized queries
Support for both single-value and array parameters

Unique Features

Some standout features of Druid SQL include:

PIVOT and UNPIVOT Operations

While still marked as experimental features, Druid supports both PIVOT and UNPIVOT operations. PIVOT helps transform row values into column headers, particularly useful for creating cross-tabulated views of data. UNPIVOT performs the reverse operation, converting columns back into rows.

UNNEST Capabilities

The UNNEST clause provides functionality for working with array-typed values, allowing you to expand array elements into individual rows. This is particularly useful when dealing with multi-value fields or complex data structures.

Dynamic Parameters

The system supports parameterized queries using question mark (?) placeholders, which can help improve query performance by reducing parsing overhead and allowing for query reuse. These parameters can represent single values or even entire arrays, making them versatile for various query patterns.

Performance Considerations

Druid SQL includes several features to optimize query performance:

LIMIT clause pushdown to data servers when possible
Support for explicit type casting to avoid type inference issues
Ability to use EXPLAIN PLAN to understand query execution
Native query translation for optimal performance

Best Practices

When working with Druid SQL, consider these recommendations:

Use explicit column selection instead of SELECT * for better query stability
Leverage dynamic parameters for frequently executed queries
Cast string comparisons explicitly for optimal performance
Use proper quotation for identifiers and string literals
Understand the limitations of features marked as experimental

Integration Points

Druid SQL can be accessed through multiple interfaces:

HTTP POST APIs:
- Synchronous (Interactive) API: For executing SELECT queries that require immediate results
- Asynchronous (MSQ) API: Supports both SELECT and INSERT operations, ideal for longer-running queries
JDBC driver for traditional database connectivity
Various client libraries that support the Druid SQL interface

The SQL interface provides a familiar entry point to Druid’s powerful analytics capabilities while maintaining the performance advantages of its native query engine.

Understanding Apache Druid Native Queries

“In nature, nothing is perfect and everything is perfect. Trees can be contorted, bent in weird ways, and they're still beautiful.”

Alice Walker

While Druid SQL provides a familiar interface, understanding Druid’s native query language is essential for advanced usage and optimization. Native queries in Druid are JSON-based and offer low-level access to Druid’s powerful query engine.

Native Query Basics

Native queries in Druid are structured as JSON objects and are typically sent to Broker or Router processes. Here are the key aspects:

Queries are sent via HTTP POST requests
Content-Type and Accept headers can be either application/json or application/x-jackson-smile
Queries can be executed against Brokers, Routers, Historical processes, or even Peons running stream ingestion tasks

Query Types

Druid offers several specialized query types for different use cases:

Aggregation Queries

Timeseries: Optimized for time-based aggregations
TopN: Best for ranked results with dimensional grouping
GroupBy: Most flexible but potentially less optimized than specialized types

Metadata Queries

TimeBoundary: For determining data time ranges
SegmentMetadata: For segment-level information
DatasourceMetadata: For datasource metadata retrieval

Other Queries

Scan: For raw data access
Search: For text-based searches

Choosing the Right Query Type

When deciding which query type to use, consider these guidelines:

For aggregation queries, prefer Timeseries or TopN when possible as they’re optimized for specific use cases
Use GroupBy when you need more flexibility or when other query types don’t fit your needs
Select specialized queries (like TimeBoundary or SegmentMetadata) for specific metadata needs

Query Management

Druid provides several features for managing queries:

Query Cancellation

Queries can be cancelled using their unique identifier
Cancellation is done via a DELETE request to the query endpoint
Format: DELETE /druid/v2/{queryId}

Error Handling

Druid provides structured error responses with:

HTTP status codes indicating the type of error
JSON response containing detailed error information
Error codes for specific failure scenarios
Host information where the error occurred

Performance Considerations

Native queries are designed to be:

Lightweight and fast to execute
Close to Druid’s internal computation model
Suitable for building complex visualizations through multiple targeted queries

For complex analysis or visualization needs, it’s often better to make multiple focused native queries rather than trying to accomplish everything in a single complex query.

This native query interface represents Druid’s fundamental query layer, providing direct access to Druid’s powerful querying capabilities while maintaining high performance and flexibility.

Conclusion: Your Real-Time Analytics Journey

“The journey of a thousand miles begins with one step.”

Lao Tzu

We’ve covered quite a bit of ground in our exploration of Apache Druid’s internals. From its thoughtfully designed architecture to its powerful extension capabilities, we’ve seen how Druid manages to deliver on its promise of real-time analytics at scale.

But remember — understanding a tool deeply is just the beginning. The real value comes from applying this knowledge to solve real-world problems. Let’s recap what we’ve learned:

Druid’s architecture balances complexity with practicality, giving us both power and flexibility
The deployment options (Kubernetes Operator, Helm Charts, managed services) each serve different needs and team capabilities
The extension system provides a clear path for customization when needed
SQL support brings familiarity while native queries offer power
Each feature comes with its trade-offs, and understanding them is key to success

I’ve tried to share not just the “what” but also the “why” behind Druid’s design choices. This understanding is crucial when you’re building systems that need to handle real-time analytics at scale.

Looking for inspiration? Remember those success stories we mentioned:

How Airbnb leverages Druid in its analytics system architecture
Netflix’s use of Druid ensures a high-quality streaming experience
Confluent’s journey of scaling Druid for real-time cloud analytics
Uber’s implementation for mobile app crash analytics
Stripe’s impressive handling of Black Friday-Cyber Monday transactions

Each of these stories demonstrates Druid’s capabilities in different contexts. Your use case might be different, but the principles and patterns we’ve explored remain relevant.

Remember: “In real-time analytics, the best time to start learning was yesterday. The second best time is now.”

Feel free to reach out if you want to discuss your specific use cases or share your experiences. The journey into real-time analytics is always more interesting when shared!

P.S.

Special thanks to the Apache Druid community members who provided technical review and corrections for this article.