Leading Cloud Computing Service Provider Chose Apache Doris + Iceberg for Hyperscale Data Lakehouse

The world's cloud computing service giant chose Apache Doris + Apache Iceberg to upgrade its data platform into a flexible, efficient data lakehouse with low costs. This solution handles reporting and BI, federated analysis, log storage and analysis, and high-concurrency analysis. With Apache Doris, this company has successfully launched 20+ projects with 50+ clusters, 3000+ nodes, and over 15 petabytes of data.

How Apache Doris Addressed Challenges?

Previously, this company adopted a separated data warehouse and data lake architecture, which led to data silos, poor query performance, and severe operational complexity.

How Apache Doris Addressed Challenges?.png

Doris mainly delivered:

Efficient lakehouse queries: Doris could easily connect and efficiently access data in the data lake for query acceleration, and results could be written back to Iceberg, eliminating data silos for higher query efficiency.

Real-time queries: Real-time data streams could be imported directly into Doris tables, ideal for sub-second analysis and queries.

Apache Doris's Applications in Core Scenarios

Real-Time Reporting & Multi-Dimensional Analytics

Doris supports real-time data analytics and ad-hoc queries:

Real-Time Reporting & Multi-Dimensional Analytics.png

Higher reporting efficiency: Doris solves instability and insufficient scalability led by the original Impala+Redis architecture, enhancing reporting and data processing efficiency.
Sub-second queries: Doris improves data retrieval and query performance by replacing the Oracle database and PostgreSQL in the BI system, enabling real-time data analytics for business decision-making.
Unified technology stack: consolidating multiple technical components into Doris simplifies the architecture, reducing operational complexity.

Lakehouse Federated Analysis

Doris lakehouse architecture:

Lakehouse Federated Analysis.png

Doris directly accesses data in Iceberg tables for real-time queries, leveraging data cache and materialized views and transparent acceleration for higher query performance. With Apache Ranger's unified management of multiple components, including Doris, Spark, Hive, and Iceberg, and Doris's cross-source federated query capabilities, this data lakehouse delivers consistent, high-performance real-time analytics.

Specifically, Doris enables query performance tuning and data write-back to Iceberg:

Query performance tuning:

Doris supports the lazy materialization of complex types such as Array, Map, and Struct. Lazy materialization allows reading condition columns before filtering, reducing network IO from hundreds of GB to hundreds of MB for faster queries.

Query performance tuning-.png

Data write-back to Iceberg:

Users processes data directly in Doris and then writes the results back to Iceberg by leveraging layered data processing in asynchronous materialized views. Based on the shared storage of Iceberg, these results can be seamlessly shared with other Doris clusters.

Data write-back to Iceberg-.png

Log Storage and Analysis

As ELK (Elasticsearch + Logstash + Kibana) struggled with high costs, inefficient queries, and limited scalability when processing large-scale logs, Doris brings improvements:

Log Storage and Analysis.png

Large data volumes/critical data: stored in Kafka for traceability, and processed by Flink or imported into Doris through Routine Load
Small data volumes (less than 10 billion/day): directly imported into Doris through Open Telemetry

Doris achieved 5x higher write throughput, 80% reduction in storage costs, sub-second log retrieval on 10 billion logs, and 3x faster queries.

Artificial Intelligence of Things (AIoT)

The AIoT platform is a real-time data analytics system. It handles 10,000+ concurrent data loads, data tables of 10 billion records, and hundreds of millions of daily incremental data. Most queries return single-row results, demanding high concurrency and low latency. For this, Doris provides significant support:

Intelligent partitioning & bucketing: Doris's partition pruning and bucketing parallelism can uniform data distribution, reduce the data scanning range, and lower I/O costs for higher query performance.
Advanced indexing system: Doris supports various rich indexes, such as BloomFilter index and inverted index, to reduce I/O and accelerate queries.
Efficient batch loading: users can batch data (from several MB to GB in size) on the client side before loading, avoiding severe write amplification issues caused by high-frequency small loads. For high-concurrency small data volume loading, users can implement server-side batching with Group Commit for lower loading costs.

With Doris, the AIoT platform achieved average QPS 8,000 and peak QPS 15,000.

The Future: Seamless Migration to Apache Doris

This company will explore Doris further:

Introducing Doris compute storage decoupled mode: addressing challenges such as data sharing, tiered storage, and resource isolation in business scenarios.
Accelerating Doris's adoption: applying Doris in data analytics, log retrieval, and ad-hoc query scenarios to accelerate digital transformation.
Engaging in Doris community: collaborating to support the development of metadata, Catalog, CCR, and Doris compute storage decoupled mode for sustainable progress.

Leading Cloud Computing Service Provider Chose Apache Doris + Iceberg for Hyperscale Data Lakehouse

How Apache Doris Addressed Challenges?

Apache Doris's Applications in Core Scenarios

Real-Time Reporting & Multi-Dimensional Analytics

Lakehouse Federated Analysis

Log Storage and Analysis

Artificial Intelligence of Things (AIoT)

The Future: Seamless Migration to Apache Doris

Apache Doris

Company

Security

Learn

Follow Us