Big Data Query：MongoDB

In the previous articles, [Big Data Query：ClickHouse](https://xx/Big Data Query：ClickHouse) and [Big Data Query：Doris](https://xx/Big Data Query：Doris), we explored two widely used engines in the big data ecosystem — ClickHouse and Doris — examining their architectures, design principles, and application scenarios.

ClickHouse and Doris are primarily designed for structured data. However, with the rapid growth of the Internet, mobile applications, and the Internet of Things (IoT), data has become increasingly complex and dynamic. Business models evolve faster than ever, and data structures are no longer fixed. Traditional relational databases struggle to handle such fluidity.

To address these challenges, MongoDB was born.
In this article, we’ll explore MongoDB from several dimensions — architecture design, data querying, core advantages, and application scenarios — to understand how it empowers modern data systems.

What Is MongoDB?

MongoDB is a document-oriented NoSQL database that stores data in BSON (Binary JSON) format. It supports high-performance reads and writes and naturally integrates with the broader big data ecosystem. MongoDB provides a flexible and efficient solution for large-scale, semi-structured, and dynamic data.

A document-oriented database differs from traditional relational databases in that it doesn’t require a predefined schema. Instead, data is stored in flexible JSON-like documents.
For example:

{
  "user_id": 1001,
  "name": "Bob",
  "tags": ["music", "travel"],
  "profile": {
    "age": 25,
    "country": "USA"
  }
}

In MongoDB, the equivalent of a “table” is a collection. Documents within the same collection can have different structures and fields, making this model ideal for rapidly changing business environments, such as e-commerce promotional campaigns.

As a document-oriented NoSQL database, MongoDB’s core features include:

Flexible Document Model
Native Distributed Architecture
Powerful Query Language

Architecture Design

MongoDB is designed to store data in a way that closely resembles application-layer structures while maintaining data consistency, reliability, and high scalability.

Its architecture consists of three main components:

mongod — The core database process responsible for data storage and querying.
mongos — The query router in a sharded cluster, responsible for request routing.
config server — Stores metadata for sharded clusters.

Two fundamental concepts underpin MongoDB’s architecture: Replica Sets and Sharding.

Replica Sets

MongoDB achieves high availability through replica sets.
A replica set contains one primary node and multiple secondary nodes. The primary node handles write operations, while secondary nodes replicate data to maintain consistency.
If the primary node fails, MongoDB automatically elects a new primary from the secondaries — ensuring fault tolerance and data safety.

Sharding

To maintain high performance on massive datasets, MongoDB supports sharding, enabling horizontal scalability.
Data is distributed across multiple nodes based on a shard key, allowing the system to process queries in parallel and maintain performance even as data volume grows.

Data Querying

MongoDB doesn’t use SQL. Instead, it provides a document-based query syntax that is intuitive and JSON-like.
It supports a rich set of operations — filtering, sorting, aggregation, and indexing.

Using the example document above, here are a few common query patterns:

Basic Query
Find all users from the USA: db.users.find({ "profile.country": "USA" })
Conditional and Range Query
Find users aged between 25 and 30: db.users.find({ "profile.age": { "$gt": 25, "$lt": 30 } })
Aggregation Query
Count the number of users per country:
js db.users.aggregate([ { $group: { _id: "$profile.country", count: { $sum: 1 } } }, { $sort: { count: -1 } } ])

The Aggregation Pipeline allows multi-stage data transformations — much like SQL’s analytical queries — enabling MongoDB to perform complex, OLAP-style computations efficiently.

Core Advantages

MongoDB provides high-performance querying for document data through a series of architectural and engine-level optimizations:

Memory and Caching Mechanism
MongoDB uses a sophisticated caching layer with an LRU (Least Recently Used) policy and compression to improve memory utilization and query speed.
Vectorized Computation
By leveraging CPU SIMD instructions, MongoDB processes data in batches instead of row by row, significantly boosting performance — similar to ClickHouse’s execution engine.
Flexible Indexing
MongoDB offers a wide range of indexing strategies:
single-field, compound, text, geospatial, and TTL indexes.
These allow MongoDB to handle both structured and semi-structured data efficiently.
Distributed Query Execution
MongoDB automatically distributes query plans based on shard keys, executes queries in parallel, and minimizes full cluster scans.
Parallel Aggregation
The Aggregation Pipeline supports distributed and parallel aggregation, expanding MongoDB’s analytical use cases.

Application Scenarios

MongoDB is widely adopted in production environments for its versatility. Common use cases include:

Log Querying – Store logs in JSON format for flexible and fast content-based queries.
Content Management and Search – Combine with full-text indexing to enable content retrieval and search capabilities.
IoT Data Analysis – Store heterogeneous sensor data in JSON documents and perform real-time queries and analytics efficiently.

Conclusion

MongoDB redefined how we think about databases by moving beyond rigid, structured storage toward a document-centric model.
Its flexible schema, intuitive query syntax, and distributed scalability make it ideal for managing complex, rapidly changing data.

Unlike OLAP systems such as ClickHouse or Doris, MongoDB focuses on multi-model querying built upon unstructured data — evolving continuously to bridge the gap between operational and analytical workloads in the modern data landscape.