post thumbnail

Big Data Query:Doris

Doris delivers sub-second queries on billion-row datasets with MySQL compatibility. Key features: Minimalist MPP architecture (FE/BE nodes) Vectorized engine & columnar storage Real-time ingestion (Kafka/Flink/HTTP) Single-system simplicity (No Hadoop/ZooKeeper) Ideal for BI dashboards, user analytics, and live data warehousing.

2025-10-20

In the previous article, [Big Data Query:Turning Data into Decisions](https://xx/Big Data Query:Turning Data into Decisions), we discussed how data only generates value when it is continuously used. During this process, new requirements arise for query capabilities.
In the article [Big Data Query:ClickHouse](https://xx/Big Data Query:ClickHouse), we explored ClickHouse, a performance-focused query engine.
In this chapter, we’ll introduce another performance-oriented engine — Doris.

In real-world production environments, performance is not the only factor that matters. Ease of use and maintainability are equally important in choosing a technical solution.
Enterprises not only expect high performance to support large-scale, fast queries, but also prefer architectures that are simple, low-cost to develop, and easy to maintain.

Doris was born to meet these exact needs. With its minimalist architecture and high-performance real-time querying capabilities, it has quickly become a popular query engine among enterprises.

This article explores Doris from four aspects: technical principles, architecture design, core advantages, and application scenarios.

What Is Doris?

Doris originated from Baidu’s internal project Palo, and was later donated to the Apache Software Foundation, where it successfully graduated as Apache Doris.
It is more than just an OLAP engine — Doris integrates data lake capabilities and real-time analytical processing, making it a high-performance, real-time MPP (Massively Parallel Processing) analytical database.

Guided by the principles of extreme speed and minimalist design, Doris stands out for its simple architecture, easy deployment, and strong real-time capabilities, making it suitable for diverse scenarios such as log analysis, BI reporting, user profiling, and monitoring analytics.

Technical Principles

Doris is designed to support both high-throughput real-time ingestion and sub-second multi-dimensional querying within a single system. Achieving this balance requires multiple technologies working together. Its core technical principles include:

  1. Columnar Storage
    Like other analytical databases, Doris uses column-oriented storage, which stores data from the same column together. This allows only the required columns to be read during queries, dramatically reducing I/O costs and improving efficiency for large-scale analytical workloads.
  2. Vectorized Execution
    Vectorized execution leverages CPU SIMD (Single Instruction, Multiple Data) instructions to perform batch operations.
    Since version 1.2, Doris has fully adopted a vectorized execution engine, allowing it to process multiple rows simultaneously, greatly accelerating query execution.
  3. Primary Key and Aggregation Models
    Doris supports multiple table models to meet different storage and query requirements, enabling both detailed analysis and real-time metric aggregation:
    • Duplicate Key – Stores all rows (like no-primary-key mode in RDBMS). Ideal for raw log data.
    • Aggregate Key – Aggregates rows with the same key. Suitable for analytical summaries.
    • Unique Key – Maintains unique keys, allowing updates by primary key. Ideal for real-time data synchronization.
  4. Real-Time Ingestion
    Unlike traditional OLAP systems, Doris supports high-concurrency real-time ingestion and offers various ingestion methods — including Kafka streaming, Flink integration, and HTTP API ingestion.
  5. Flexible Storage
    Doris supports both co-located storage and compute (to minimize data transfer overhead) and storage-compute separation, which reduces storage costs while maintaining query efficiency.

Architecture Design

Doris adopts a minimalist MPP architecture, consisting of two core components: FE (Frontend) and BE (Backend).

Its architecture is illustrated below:

This design is simple yet powerful, with notable characteristics:

Core Advantages

Doris has rapidly become a mainstream big data query engine due to the following advantages:

Application Scenarios

Doris is widely used across industries for various big data analytics use cases, including:

Conclusion

In the evolution of big data query technologies, Doris embodies the concept of stream-batch unification, making real-time data warehousing practical and easy to implement.
With ongoing community development, Doris is also progressing toward lakehouse integration, further expanding its role in the modern data ecosystem.