Back
Engineering

Apache Doris Achieves 70% Better Price-Performance on ARM-based AWS Graviton

VeloDB Engineering Team

To get faster analytics at a lower cost, hardware efficiency matters. That's why Apache Doris has done extensive optimization on ARM architecture, as cloud workloads are increasingly moving to ARM.

Results:

We benchmarked Apache Doris on ARM-based AWS Graviton against x86 instances across five industry-standard OLAP tests (ClickBench, SSB 100G, SSB-Flat, TPC-H, and TPC-DS), and found Doris running on ARM-based Graviton consistently delivered 54%–70% higher price-performance.

This performance gain and cost savings underscore Doris' ability to take full advantage of modern ARM architecture. By utilizing CPU vectorization capabilities to improve data throughput and leveraging ARM's multithreading with synchronization, Doris can deliver even faster analytical queries, higher concurrency, and lower costs for real-time analytics workloads on ARM architecture.

In this article, we'll dive into the benchmark results and explore the architectural optimizations behind Doris' performance on ARM-based Graviton.

What are ARM, AWS Graviton, and Apache Doris?

ARM architecture, once best known for powering mobile devices like iPhones, is rapidly making its way into data centers. Built on a streamlined, energy-efficient design, ARM chips deliver higher performance per watt and lower total cost compared to traditional x86 processors.

Cloud providers like AWS are rolling out powerful ARM-based CPUs, such as AWS Graviton. Compared to x86, ARM-based Graviton is particularly suitable for data-intensive workloads, with better price-performance, scalability, and sustainability.

Released in 2024, the AWS Graviton4 chip delivers higher performance and energy efficiency for EC2 workloads, compared to the Graviton3 (Doris showed a 32% cost efficiency boost on Graviton3, more details here). Built on the Arm Neoverse N3 architecture, it provides 50% more core threads and 75% greater memory bandwidth than Graviton3, enabling higher-throughput data processing. Graviton 4 also has a more advanced manufacturing process and improved power management, further enhancing its energy efficiency with dynamic voltage and frequency scaling.

Apache Doris is a high-performance, real-time analytical database built on MPP (Massively Parallel Processing) architecture. Doris is designed for fast processing of large volumes of real-time data. It supports diverse enterprise analytics scenarios, including reporting and dashboards, ad-hoc queries, unified data warehousing, and lakehouse integration.

Apache Doris Performance on ARM Architecture in Five Benchmarks

To evaluate Apache Doris' performance on ARM-based AWS Graviton4, we ran compute–storage separated clusters on VeloDB Cloud deployed on AWS, comparing ARM (c8g.4xlarge) and x86 (c6i.4xlarge) instances.

Across five industry-standard benchmarks — ClickBench, SSB 100G, SSB-Flat 100G, TPC-H 100G, and TPC-DS 100G — Doris on Graviton4 consistently outperformed the x86 setup, achieving 50% higher query performance on average. Specifically, Doris achieved 55% faster results on ClickBench, 44% on SSB 100G, 44% on SSB-Flat 100G, 45% on TPC-H 100G, and 60% on TPC-DS 100G.

When factoring in Graviton's lower instance cost, the advantage is even more striking: Doris delivered up to 70% better price-performance, with every benchmark showing clear efficiency gains on ARM. Specifically, Dpris delivered 65% better price-performance ratio on ClickBench, 54% on SSB 100G, 54% on SSB-Flat 100G, 55% on TPC-H 100G, and 71% on TPC-DS 100G.

Price-Performance Ratio Calculation:

If system A delivers 50% higher performance than system B and costs 8% less, then:

  • Performance: A = 1.5 × B
  • Price: A = 0.92 × B

Taking B's price-performance ratio as 1, A's ratio becomes:

  • 1.5 / 0.92 = 1.63,

which means A achieves a 63% better price-performance than B.

20251030_Graviton_Five_Benchmark_1.png

20251030_Graviton_Five_Benchmark_2.png

1. Cluster Configuration

We set up Apache Doris clusters on both x86 and ARM machines in the VeloDB cloud service for testing, with each cluster configured as 1 FE + 3 BE.

  • x86 architecture: c6i.4xlarge instance, equipped with Ice Lake 8375C processors;
  • ARM architecture: c8g.4xlarge instance, equipped with AWS's proprietary Graviton4 processors.

The detailed configuration is as follows:

20251030_Graviton_Cluster.png

2. Benchmark and Datasets

To thoroughly evaluate performance across different workloads, we ran the Apache Doris on ARM vs. x86 tests on five representative benchmarks: ClickBench, SSB 100G, SSB-Flat 100G, TPC-H 100G, and TPC-DS 100G.

20251030_Graviton_Benchmark_Dataset.png

In each benchmark, all SQL queries are executed sequentially. Each query is run three times consecutively (once as a cold query and twice as hot queries). For the hot queries, the fastest execution time is taken as the actual time for that SQL. The results are then aggregated to produce the final outcome.

Detailed Benchmark Procedure: https://doris.apache.org/docs/dev/benchmark/ssb

3. Performance comparison between ARM and x86 across all benchmarks

Here are the performance results of each benchmark across different test suites, with latency measured in milliseconds (ms).

ClickBench
ClickBench.png

SSB
SSB.png

SSB Flat 100G
SSB Flat 100G.png

TPC-H 100G
TPC-H 100G.png

TPC-DS 100G
TPC-DS 100G.png

How Apache Doris Optimized for AWS Graviton

Since Doris 2.0, Apache Doris has been continuously optimizing its performance on ARM architectures. By leveraging ARM's vectorized instruction sets, as well as fine-grained kernel scheduling and memory management tuning, Doris has achieved a significant improvement in query processing efficiency on ARM processors.

The key optimizations include:

  • Comprehensive vectorization support on ARM: During data processing, Doris utilizes the CPU's SIMD (Single Instruction Multiple Data) vectorization capabilities to improve data throughput per unit time—particularly effective in OLAP workloads. The vectorized instructions previously based on SSE/AVX on x86 have been ported to ARM's NEON/SVE instruction sets, enabling Apache Doris to deliver equally high-performance data processing capabilities on ARM platforms.
  • Optimized multithreading synchronization: Compared to x86 architectures, ARM offers a more relaxed memory ordering model, allowing threads to execute in parallel more efficiently. Apache Doris fully exploits ARM's multithreading capabilities by intelligently selecting synchronization mechanisms based on specific performance bottlenecks. This minimizes synchronization overhead, ensuring that CPU resources are focused on core data-processing tasks.
  • High-efficiency task scheduling model: Apache Doris excels at handling a large number of concurrent tasks under high load, thanks to its high-performance query execution engine. This engine fully leverages the multi-core parallelism of modern CPUs, decomposing and parallelizing queries during execution. Combined with ARM's advantages in power efficiency and cost-effectiveness, users can deploy more CPU cores per workload. Apache Doris can fully utilize these additional cores to accelerate SQL query execution further.

Conclusion

In summary, Apache Doris on AWS Graviton4 showed significant gains in performance and cost efficiency. The Doris and ARM combination leverages the energy efficiency of the ARM architecture and the analytical power of Doris, helping teams get ahead in both performance and cost optimization.

Want to learn more about Apache Doris and its real-time performance across different hardware? Join the Apache Doris community on Slack and connect with Doris experts and users. If you're looking for a fully-managed, cloud-native version of Apache Doris, contact the VeloDB team.