Back
Engineering

Apache Doris 4.1 Spill to Disk: Running Memory-Intensive Queries Without OOM

2026/5/8
Matt Yi
Matt Yi
Apache Doris PMC Member and Tech VP @ VeloDB
Keywords:

Picture an analytics team running its end-of-quarter close. The ETL job joins three years of transaction history against a customer dimension and a product hierarchy, then aggregates by region for the executive deck. Three hours in, the database server runs out of memory, and the query dies with an OOM error. The team retries with a smaller date range, splits the job in half, and ships the deck the next morning instead of that evening. If the spill-to-disk feature had been enabled, that job would have finished, and there would be no morning panic.

Apache Doris 4.1 release matured this feature for the Doris community.

TL'DR of what changed in 4.1:

  • Full operator coverage: Hash Join, Aggregation, and Sort all spill. These three operators account for almost all memory pressure in analytical queries.
  • Recursive repartitioning: When a recovered partition is still too large for memory, the engine repartitions it again. This handles heavy data skew, not just average-case workloads.
  • Proactive triggering: Doris detects memory pressure early and starts spilling before hitting a hard limit, rather than waiting for a query to fail and then trying to recover.

The rest of this post covers what Spill to Disk is, which workloads benefit most from it, how Apache Doris implements it end-to-end, the configuration and observability surface, and the full TPC-DS benchmark.

1. Memory-Intensive Operators and Why They Need Spill to Disk

In Apache Doris, three operators sit at the top of the OOM-risk list. They share one trait: each must accumulate state over the entire input before producing results. Memory usage scales with input size, cardinality, or skew, so the worst case sits far above the average. Spill to Disk targets exactly these three. The following operators most likely to hit memory limits are those that need to accumulate state:

OperatorMemory Bottleneck
Hash JoinBuild side requires loading the full dataset into a hash table
AggregationMaintains a hash table of group keys and intermediate aggregation state
SortMust buffer all data before sorting and outputting

These three share two characteristics that make them candidates for spilling:

  • Memory usage scales with data size: Input volume, cardinality, and data skew all affect how much memory they consume.

  • State is serializable: The state these operators maintain doesn't need to stay in memory permanently. It can be written to disk and read back on demand.

The core concept behind spill to disk is to convert revocable memory into disk state, then read it back in smaller streaming batches during the recovery phase.

2. How Apache Doris Implemented Spill to Disk

Doris does not implement spill inside each operator in isolation. A control layer and an operator layer coordinate the work, and share a common infrastructure layer beneath them.

pic1.png

The four layers:

  • Control layer: PipelineTask handles the reserve entry point and memory pressure detection. WorkloadGroupMgr manages paused queries and decides whether to spill or cancel. QueryTaskController selects specific spill targets and creates SpillContext.

  • Operator layer: Three core operators: Partitioned HashJoin, Partitioned Aggregation, and Spill Sort, each implement revoke_memory() to handle their own data persistence and recovery.

  • Infrastructure layer: SpillFileManager manages disk directories and garbage collection. SpillFile, SpillFileWriter, and SpillFileReader provide block-level read/write abstractions. SpillRepartitioner handles multi-level repartitioning.

  • Memory management layer: Three-level memory checks (Query Limit → Workload Group Limit → Process Limit) form a layered memory defense.

3. The Trigger Flow: Reserve, Pause, Spill, Resume

The entire Spill to Disk process follows a pause → persist → resume cycle.

Overview

pic2.png

Overall trigger flow: reserve, pause, spill, resume

Reserve Memory: Three-Level Validation

Before each execution round, PipelineTask estimates how much memory the operator will need for the next step, then calls try_reserve() to reserve that amount. This isn't a real malloc, it's a pre-check to determine whether execution should continue before entering operator logic.

try_reserve() validates three constraints simultaneously:

pic3.png

Three-level reserve validation

Key design note: When enable_reserve_memory is on, QueryContext will disable the query tracker's immediate hard check. Apache Doris prefers that queries enter a "pausable and spillable" path rather than failing immediately when a limit is hit.

Proactive Spill Trigger

Beyond the passive trigger (reserve failure), PipelineTask also checks proactively whether spill should be initiated early. Spill triggers when all three conditions are met simultaneously:

ConditionDescription
reserve_size × parallelism > query_limit / 5The estimated reserve for this round is large relative to the query limit (exceeds 20% across concurrency dimensions)
System under high memory pressureQuery memory usage ≥ 90%, or Workload Group touches low/high watermark
revocable_mem × parallelism ≥ query_limit × 20%Current revocable memory in the pipeline is large enough to be worth spilling

The goal of the proactive spill trigger is to avoid waiting until memory allocation fails. It proactively converts revocable memory to disk state when it detects memory under high pressure.

Paused Query State

When a reserve fails or proactive revoking triggers, the query enters a Paused state:

  • Records the pause reason (QUERY_MEMORY_EXCEEDED / WG_MEMORY_EXCEEDED / PROCESS_MEMORY_EXCEEDED)
  • Enables low memory mode
  • Blocks on memory_sufficient_dependency

The query stops executing and waits for WorkloadGroupMgr to handle it.

WorkloadGroupMgr Decision Logic

WorkloadGroupMgr follows different branches depending on the pause reason.

pic4.png

WorkloadGroupMgr decision flow

Two key constraints to keep in mind:

  • Only Paused Queries are eligible for spill triggering by WorkloadGroupMgr.
  • QueryTaskController requires that there are no running tasks in a fragment before starting revocation. Apache Doris doesn't allow uncoordinated memory reclamation of a query's state while it's actively executing.

QueryTaskController: Selecting Spill Targets

QueryTaskController::revoke_memory() executes the following logic:

  1. Collect all revocable tasks across all fragments
  2. Sort by revocable size in descending order
  3. Select a batch of tasks: target recovery amount = current actual memory consumption × 20% (uses actual consumption, not the limit. For example, if the limit is 1.6 GB but only 1 GB is in use, the target recovery amound is 200 MB, not 300 MB)
  4. Create a SpillContext for the selected tasks (with a completion callback)
  5. Call each task's revoke_memory(), conditional on revocable_mem_size >= MIN_SPILL_WRITE_BATCH_MEM (512 KB)
  6. When all tasks complete, the SpillContext callback fires → memory_sufficient = true → query resumes scheduling

4. Shared Infrastructure

Although each operator implements spill differently, they share a common infrastructure layer.

SpillFileManager

Role: Global management of spill storage directories and disk resources.

                     SpillFileManager

          ┌───────────────┼───────────────┐
          │               │               │
    ┌─────▼─────┐  ┌─────▼─────┐  ┌─────▼─────┐
    │ SpillData │  │ SpillData │  │ SpillData │
    │ Dir (SSD) │  │ Dir (SSD) │  │ Dir (HDD) │
    └───────────┘  └───────────┘  └───────────┘

Key behaviors:

  • Disk selection: Prefers SSD. Selects among available disks in ascending order of usage rate.
  • Background GC: Periodically scans and cleans expired spill directories.
  • Capacity management: Checks remaining disk space via reach_capacity_limit().
  • Read/write metrics: Maintains global counters for spill read and write byte totals.

SpillFile, SpillFileWriter, and SpillFileReader

SpillFile is a logical file abstraction that can automatically split into multiple parts under the hood:

pic5.png

SpillFile internals

This design lets upper-level operators focus on "write block / read block" without managing file splitting or metadata.

Fault tolerance: SpillFile's destructor automatically calls gc() to delete the spill directory and release the disk quota. Even if the writer doesn't close normally, byte counts continue to be tracked, ensuring GC has accurate disk usage statistics.

SpillRepartitioner

SpillRepartitioner is the core building block for multi-level spill in Apache Doris, primarily used by Hash Join and Aggregation:

pic6.png

SpillRepartitioner

Key characteristics:

  • Two modes: Join mode (uses PartitionerBase for hashing) and Aggregation mode (uses key column index for CRC32 hashing).
  • Incremental processing: Processes approximately 32 MB at a time, then yields CPU back to the scheduler to avoid long blocking.
  • Persistent Reader: Maintains read position across yields, avoiding redundant re-reads.

5. How Spill to Disk Works with Each Operator

Hash Join:

Hash Join is the most complex part of Apache Doris's spill implementation because it requires coordination between the build and probe sides, and recovery may require recursive repartitioning.

Normal Path vs. Spill Path

pic7.png

First Spill: Build Side Switches from Full Hash Table to Partitioned Writes

  • First spill (is_spilled == false): Extract build data from the internal hash table → partition by join key → write each partition to its own SpillFile → disable Runtime Filter (the build side is no longer fully resident in memory)
  • Subsequent spills (is_spilled == true): Continue flushing the current in-memory partition blocks to their corresponding spill files

Why the Probe Side Also Spills

Once a build-side partition has been written to disk, the probe side cannot use the current hash table for that partition, because the table no longer contains the complete build data. The probe operator partitions using the same join key, writes probe rows for spilled partitions to disk, and continues joining non-spilled partitions against the in-memory hash table.

Recovery Phase: Processing Partitions One at a Time

When input is exhausted, the probe side assembles, builds, and probe spill files into a unified Partition Queue (_spill_partition_queue). Each element is a JoinSpillPartitionInfo containing both a build_file and a probe_file. Each partition is processed in four steps:

  1. Recover build blocks from the build spill file into memory
  2. Build the internal hash table from those build blocks
  3. Recover probe blocks from the probe spill file
  4. Complete the join for that partition and output results

Recursive Repartitioning: Handling Data Skew

If a recovered partition's build data is still too large for available memory, SpillRepartitioner splits the current build/probe partition into next-level sub-partitions (level + 1), which are placed back into _spill_partition_queue. This repeats up to spill_repartition_max_depth levels.

This means Apache Doris' Hash Join spill is not a simple write to disk, read back operation, it's partitioned execution with recursive partitioned recovery as needed.

Aggregation:

This section covers the Partitioned Aggregation spill. Streaming Aggregation does not support spill; related parameters only limit its memory usage.

Sink Side: Serialize Group State and Partition to Disk

pic8.png

When revoke_memory() triggers: iterate the current hash table → partition by group key hash → serialize key columns and aggregate state into blocks → write each partition to a SpillFile → reset the hash table.

Key point: Aggregation spill writes intermediate aggregation state, not final results. The same group may be spilled multiple times, and recovery requires re-merging.

Source Side: Recover by Partition, Merge, Output

  1. Read a batch of blocks from the spill file
  2. merge_with_serialized_key: merge the serialized key/state back into the in-memory hash table
  3. Repeat until the current partition is fully read
  4. Produce final results from the hash table
  5. Reset the hash table and continue to the next partition

If the hash table grows large again during recovery, SpillRepartitioner creates sub-partitions, which are placed back into _partition_queue.

Aggregation recovery doesn't pipe spill file data directly to downstream operators. It first reconstructs the hash table, then produces output normally from that table.

Order By/Sort:

Spill for Sort is different from Join or Aggregation. It follows a classic External Merge Sort model, where the core unit is a sorted run rather than a partition.

Basic Approach

pic9.png

External Merge Sort flow

Sink Side

Each revoke_memory() trigger:

  1. Mark shared state as spilled, record limit/offset
  2. Create a new SpillFile
  3. Call the internal Sorter's prepare_for_spill() to complete sort preparation
  4. Loop reading sorted blocks and write them as one sorted run to disk
  5. Reset the sorter and continue receiving new input

Source Side: Multi-Way Merge

  • Fewer runs: Use VSortedRunMerger directly for the final merge.
  • Too many runs: Select a batch of runs for an intermediate merge to generate a temporary spill file, then repeat until the run count is small enough.

The number of files that can be merged in parallel = spill_sort_merge_mem_limit_bytes ÷ spill_buffer_size_bytes (minimum 8). The user query's limit/offset is applied only in the final merge round.

Operator Comparison Summary

pic10.png

6. Key Configuration Parameters

The most important parameters in the current implementation:

ParameterDefaultDescription
enable_spillfalseEnable spill to disk
enable_force_spillfalseForce the spill path regardless of reserve and WG watermark checks. Testing only.
enable_reserve_memorytrueEnable the "reserve first, then execute" mechanism
spill_min_revocable_mem4MBMinimum revocable memory considered worth spilling
spill_buffer_size_bytes8MBSpill block size; also affects the sort merge read buffer
spill_hash_join_partition_count4Initial partition count for joins
spill_aggregation_partition_count4Initial partition count for aggregations
spill_repartition_max_depth8Maximum recursive repartitioning depth for joins and aggregations
spill_join_build_sink_mem_limit_bytes64MBMemory threshold above which the join build sink proactively flushes after entering spilled state
spill_aggregation_sink_mem_limit_bytes64GBProactive flush threshold for the aggregation sink
spill_sort_sink_mem_limit_bytes64MBProactive flush threshold for the sort sink
spill_sort_merge_mem_limit_bytes64MBTotal memory budget for the sort merge phase

Key Workload Group watermark parameters:

  • memory_low_watermark
  • memory_high_watermark
  • slot_memory_policy

Note: low_watermark and high_watermark don't directly mean "start spilling." They signal high memory pressure to the system, which then influences how paused queries are handled.

7. Observability

When a spill occurs, these tools are available for monitoring and debugging:

  • Query Profile: Spilled operators show "Spilled": "true" in their profile, along with timers like SpillWriteSerializeBlockTime and SpillMergeSortTime. These help identify where disk write latency is concentrated.
  • Logs: Key spill events are logged at INFO level (e.g., "all spill tasks done, resume it"). More detailed partition-level information is available by enabling VLOG_DEBUG.
  • Force spill testing (enable_force_spill): Setting this to true bypasses reserve and Workload Group watermark checks and forces all spill-capable operators through the disk path. Useful for verifying spill correctness in testing without needing to reproduce real memory pressure.

8. Benchmark: TPC-DS 10TB on a Single Node

Test Environment and Configuration

Test environment:

  • 1 FE (8C16G) + 1 BE (8C16G, 500GB SSD)
  • Labeled as 1BE 10T TPC-DS test

Session variables used:

set global enable_profile = true;
set global spill_hash_join_partition_count = 8;
set global broadcast_row_count_limit = 3000000.0;
set global query_timeout = 28800;
set global profile_level = 2;
set global enable_spill = true;
set global spill_join_build_sink_mem_limit_bytes = 1073741824;
set global spill_aggregation_partition_count = 8;
set global spill_aggregation_sink_mem_limit_bytes = 1073741824;
set global exec_mem_limit = 8589934592;
set global max_file_scanners_concurrency = 8;

Workload Group configuration:

alter workload group normal properties(
    'slot_memory_policy' = 'dynamic'
);

BE configuration:

disable_storage_page_cache = true
spill_storage_limit = 95%
spill_storage_root_path = /mnt/disk1/storage
mem_limit = 14G
max_sys_mem_available_low_water_mark_bytes = 268435456
soft_mem_limit_frac = 0.95
disable_segment_cache = true
max_external_file_meta_cache_num = 64

Schema Setup

Data was prepared using an external catalog:

CREATE CATALOG `dlf` PROPERTIES (
    "type" = "hms",
    "ipc.client.fallback-to-simple-auth-allowed" = "true",
    "hive.metastore.type" = "dlf",
    "dlf.uid" = "217316283625971977",
    "dlf.secret_key" = "xxxx",
    "dlf.region" = "cn-beijing",
    "dlf.proxy.mode" = "DLF_ONLY",
    "dlf.endpoint" = "dlf-vpc.cn-beijing.aliyuncs.com",
    "dlf.catalog.id" = "emr_dev",
    "dlf.access_key" = "xxxxx"
);

Query Results

QueryTimestampTime_msPeakMemoryBytesScanBytesSpillWriteBytesToLocalStorageSpillReadBytesFromLocalStorageScanBytesFromRemoteStorage
Query12026-03-11 10:09:04.6782495116.09 GB25.43 GB11.81 GB11.81 GB25.43 GB
Query22026-03-11 10:13:14.2082882812.01 GB84.17 GB0084.17 GB
Query32026-03-11 10:18:02.509309598560.97 MB194.32 GB00194.32 GB
Query42026-03-11 10:23:12.14550967018.10 GB747.52 GB263.04 GB277.70 GB747.52 GB
Query52026-03-11 11:48:08.8815603524.67 GB518.04 GB00518.04 GB
Query62026-03-11 11:57:29.2522961853.61 GB168.82 GB00168.82 GB
Query72026-03-11 12:02:25.4516239321.28 GB367.66 GB00367.66 GB
Query82026-03-11 12:12:49.4102147931.21 GB126.25 GB00126.25 GB
Query92026-03-11 12:16:24.22519046816.45 GB1094.79 GB001094.79 GB
Query102026-03-11 12:48:08.9252626625.59 GB66.55 GB472.18 MB470.99 MB66.55 GB
Query112026-03-11 12:52:31.60430307058.00 GB249.68 GB96.46 GB100.90 GB249.68 GB
Query122026-03-11 13:43:02.32789220887.83 MB62.61 GB0062.61 GB
Query132026-03-11 13:44:31.5627019691.98 GB462.63 GB00462.63 GB
Query142026-03-11 13:56:13.56023622977.34 GB777.25 GB1.36 GB1.36 GB777.25 GB
Query14_12026-03-11 14:35:35.88120298646.96 GB834.18 GB00834.18 GB
Query152026-03-11 15:09:25.7611089415.16 GB44.86 GB0044.86 GB
Query162026-03-11 15:11:14.71727096116.53 GB194.81 GB52.38 GB52.37 GB194.81 GB
Query172026-03-11 15:56:24.34514361645.90 GB357.26 GB52.10 GB52.10 GB357.26 GB
Query182026-03-11 16:20:20.5256635718.24 GB249.82 GB1.94 GB1.94 GB249.82 GB
Query192026-03-11 16:31:24.1114435925.05 GB277.74 GB00277.74 GB
Query202026-03-11 16:38:47.718175846914.79 MB121.82 GB00121.82 GB
Query212026-03-11 16:41:43.579129841.12 GB6.44 GB006.44 GB
Query222026-03-11 16:41:56.576371062.07 GB6.46 GB006.46 GB
Query232026-03-11 16:42:33.70484849338.17 GB506.22 GB434.40 GB434.39 GB506.22 GB
Query23_12026-03-11 19:03:58.65595800748.10 GB507.09 GB495.48 GB495.46 GB507.09 GB
Query242026-03-11 21:43:38.74814027355.33 GB259.23 GB35.90 GB35.90 GB259.23 GB
Query24_12026-03-11 22:07:01.50013737455.32 GB259.20 GB35.92 GB35.91 GB259.20 GB
Query252026-03-11 22:29:55.26014200385.58 GB492.23 GB36.24 GB36.24 GB492.23 GB
Query262026-03-11 22:53:35.3133064431.37 GB189.45 GB00189.45 GB
Query272026-03-11 22:58:41.7715697072.69 GB339.86 GB00339.86 GB
Query282026-03-11 23:08:11.49419781793.50 GB968.46 GB00968.46 GB
Query292026-03-11 23:41:09.68836296525.52 GB361.30 GB202.97 GB202.96 GB361.30 GB
Query302026-03-12 00:41:39.356745677.46 GB13.95 GB2.55 GB2.55 GB13.95 GB
Query312026-03-12 00:42:53.9383380115.58 GB169.21 GB00169.21 GB
Query322026-03-12 00:48:31.965192334432.80 MB121.80 GB00121.80 GB
Query332026-03-12 00:51:44.3185888393.47 GB475.30 GB00475.30 GB
Query342026-03-12 01:01:33.1744390785.05 GB74.20 GB947.62 MB947.52 MB74.20 GB
Query352026-03-12 01:08:52.4114212396.61 GB66.72 GB5.38 GB5.38 GB66.72 GB
Query362026-03-12 01:15:53.6715604062.19 GB357.64 GB00357.64 GB
Query372026-03-12 01:25:14.091125019717.86 MB68.33 GB0068.33 GB
Query382026-03-12 01:27:19.1239638708.03 GB67.00 GB25.08 GB25.08 GB67.00 GB
Query392026-03-12 01:43:23.012101183.90 GB6.17 GB006.17 GB
Query39_12026-03-12 01:43:33.15673784.14 GB6.17 GB006.17 GB
Query402026-03-12 01:43:40.5487721764.99 GB132.54 GB22.24 GB22.24 GB132.54 GB
Query412026-03-12 01:56:32.740381219.99 MB0000
Query422026-03-12 01:56:33.135355028705.32 MB242.20 GB00242.20 GB
Query432026-03-12 02:02:28.1792332001.40 GB71.82 GB0071.82 GB
Query442026-03-12 02:06:21.39316418811.27 GB0000
Query452026-03-12 02:33:43.2901025585.62 GB55.25 GB0055.25 GB
Query462026-03-12 02:35:25.8628454446.90 GB242.34 GB1.26 GB1.26 GB242.34 GB
Query472026-03-12 02:49:31.3266599001.83 GB204.06 GB00204.06 GB
Query482026-03-12 03:00:31.2414681861.22 GB249.58 GB00249.58 GB
Query492026-03-12 03:08:19.44511037474.54 GB759.59 GB00759.59 GB
Query502026-03-12 03:26:43.2105793984.50 GB233.52 GB00233.52 GB
Query512026-03-12 03:36:22.63025199607.57 GB243.48 GB29.05 GB29.05 GB243.48 GB
Query522026-03-12 04:18:22.605350151718.79 MB242.20 GB00242.20 GB
Query532026-03-12 04:24:12.772339872618.42 MB204.05 GB00204.05 GB
Query542026-03-12 04:29:52.6624256034.80 GB265.87 GB00265.87 GB
Query552026-03-12 04:36:58.280347111752.24 MB242.20 GB00242.20 GB
Query562026-03-12 04:42:45.4076262812.79 GB476.64 GB00476.64 GB
Query572026-03-12 04:53:11.7043094932.10 GB99.12 GB0099.12 GB
Query582026-03-12 04:58:21.2164750252.30 GB402.98 GB00402.98 GB
Query592026-03-12 05:06:16.2575663881.61 GB71.82 GB0071.82 GB
Query602026-03-12 05:15:42.6605906533.19 GB476.60 GB00476.60 GB
Query612026-03-12 05:25:33.3288821324.36 GB592.47 GB00592.47 GB
Query622026-03-12 05:40:15.477857281.77 GB26.01 GB0026.01 GB
Query632026-03-12 05:41:41.221336360720.48 MB204.06 GB00204.06 GB
Query642026-03-12 05:47:17.59815592958.00 GB617.30 GB2.15 GB2.15 GB617.30 GB
Query652026-03-12 06:13:17.16923159707.42 GB408.10 GB60.44 GB60.44 GB408.10 GB
Query662026-03-12 06:51:53.1622988933.04 GB221.15 GB00221.15 GB
Query672026-03-12 06:56:52.06994084396.81 GB228.59 GB373.37 GB373.36 GB228.59 GB
Query682026-03-12 09:33:40.52311775685.38 GB366.06 GB534.01 MB533.98 MB366.06 GB
Query692026-03-12 09:53:18.1061241511.13 GB925.67 MB00925.67 MB
Query702026-03-12 09:55:22.2725456811.31 GB146.79 GB00146.79 GB
Query712026-03-12 10:04:27.9686138612.00 GB453.86 GB00453.86 GB
Query722026-03-12 10:14:42.00982431557.93 GB182.69 GB131.67 GB131.65 GB182.69 GB
Query732026-03-12 12:32:05.1864355435.01 GB74.11 GB944.57 MB944.48 MB74.11 GB
Query742026-03-12 12:39:20.75217264867.75 GB173.95 GB65.54 GB70.36 GB173.95 GB
Query752026-03-12 13:08:07.26135579688.00 GB565.43 GB196.27 GB196.26 GB565.43 GB
Query762026-03-12 14:07:25.2495796022.47 GB465.38 GB00465.38 GB
Query772026-03-12 14:17:04.8744551354.22 GB450.39 GB00450.39 GB
Query782026-03-12 14:24:40.031174529317.98 GB591.57 GB1035.60 GB1035.53 GB591.57 GB
Query792026-03-12 19:15:32.9838005498.00 GB241.61 GB1.15 GB1.15 GB241.61 GB
Query802026-03-12 19:28:53.55527419507.85 GB830.76 GB121.44 GB121.44 GB830.76 GB
Query812026-03-12 20:14:35.526903196.61 GB24.75 GB2.19 GB2.19 GB24.75 GB
Query822026-03-12 20:16:05.864220234919.21 MB133.12 GB00133.12 GB
Query832026-03-12 20:19:46.119545381.61 GB34.45 GB0034.45 GB
Query842026-03-12 20:20:40.676297471.01 GB14.12 GB0014.12 GB
Query852026-03-12 20:21:10.4431950316.93 GB117.46 GB00117.46 GB
Query862026-03-12 20:24:25.4921358321.48 GB63.11 GB0063.11 GB
Query872026-03-12 20:26:41.3429038488.00 GB67.03 GB19.99 GB19.99 GB67.03 GB
Query882026-03-12 20:41:45.21213664763.91 GB274.00 GB00274.00 GB
Query892026-03-12 21:04:31.7113752841.22 GB204.05 GB00204.05 GB
Query902026-03-12 21:10:47.01565010947.87 MB31.68 GB0031.68 GB
Query912026-03-12 21:11:52.04526913817.90 MB0000
Query922026-03-12 21:12:18.97797142481.83 MB62.60 GB0062.60 GB
Query932026-03-12 21:13:56.1385593494.51 GB277.95 GB00277.95 GB
Query942026-03-12 21:23:15.50710968246.34 GB89.60 GB22.16 GB22.15 GB89.60 GB
Query952026-03-12 21:41:32.353131636926.99 GB94.48 GB81.08 GB80.97 GB94.48 GB
Query962026-03-13 01:20:56.065161809562.34 MB34.28 GB0034.28 GB
Query972026-03-13 01:23:37.89343942238.24 GB256.93 GB246.50 GB246.45 GB256.93 GB
Query982026-03-13 02:36:52.1363675331.13 GB242.21 GB00242.21 GB
Query992026-03-13 02:42:59.7521798261.65 GB42.15 GB0042.15 GB

Important limitation: The Intersect/Except operator does not yet support spill. SQL involving that logic was rewritten using equivalent Join syntax for this test.

This reflects the current coverage boundary: spill mainly covers Join, Agg, and Sort. Set operators are not yet fully supported.

System Resource Usage

Memory usage:

pic11.png

CPU usage:

pic12.png

Representative Test Cases

Three representative SQL types cover the most important spill scenarios:

  • q14 / q14_1: Centered on cross_items and avg_sales CTEs, combining multi-table joins, Union All, aggregation, Having, Rollup, Order By, and Limit in a complex execution path.
  • q38: Joins three customer sets from store_sales, catalog_sales, and web_sales, validating execution under complex join combinations.
  • q87: Uses Left Join + IS NULL to express semantics that would normally require Intersect/Except, a workaround for the current Set spill limitation.

These cases validate that Apache Doris can stably combine and execute multiple operator spill mechanisms in real TPC-DS-style SQL.

Summary

To summarize the Apache Doris Spill to Disk mechanism in one sentence: Use Reserve Memory to detect risk early; WorkloadManager pauses the query; specific operators convert their revocable state into SpillFiles; and during recovery, that state is read back in smaller-granularity batches.

The overall flow:

PipelineTask::execute()
  ├── get_reserve_mem_size()  ──→  estimate memory needs for this round
  ├── _should_trigger_revoking()  ──→  check for high memory pressure
  │     ├── reserve × parallelism > query_limit / 5 ?
  │     ├── query_used ≥ 90% or WG touches watermark?
  │     └── revocable × parallelism ≥ query_limit × 20%?
  ├── [high pressure] add_paused_query(QUERY_MEMORY_EXCEEDED)
  │     └── WorkloadGroupMgr::handle_paused_queries()
  │           ├── already recovered → resume
  │           ├── has revocable → QueryTaskController::revoke_memory()
  │           │     ├── sort tasks by revocable size
  │           │     ├── target recovery = consumption × 20%
  │           │     └── PipelineTask::do_revoke_memory()
  │           │           └── per-operator revoke_memory()
  │           └── no revocable → disable reserve, may cancel
  └── [normal] try_reserve() → continue execution

The three core operators differ in approach:

  • Join: Partition by join key, coordinate spill across both build and probe sides, recursively repartition on recovery as needed.
  • Aggregation: Serialize intermediate aggregation state and partition by key. During recovery, merge back into the hash table before producing output.
  • Sort: Each spill produces one sorted run. Final output uses multi-way merge.

In addition, SpillIcebergTableSinkOperatorX (be/src/exec/operator/spill_iceberg_table_sink_operator.cpp) also implements revoke_memory(), supporting spill for Iceberg table write scenarios.

Operators that currently do not support spill include Window Functions / Analytic Functions and Set operators (Union/Intersect/Except). When these operators run out of memory, queries still terminate with an error.

The most important thing to understand about Apache Doris's Spill to Disk design: it's deeply embedded in Doris' execution engine, tightly coupled with PipelineTask, MemTracker, WorkloadGroup, and Operator State, making it a core scheduling mechanism in the execution engine itself.

Join the Apache Doris community on Slack and connect with Doris experts and users. If you're looking for a fully managed Apache Doris cloud service, contact the VeloDB team.

Try VeloDB Cloud for Free

SaaS warehouse free trial 14 days,
BYOC warehouse free computing service fee 90 days.

Subscribe to Our Newsletter

Stay ahead on Apache Doris releases, product roadmap, and best practices for real-time analytics and AI-ready data infra.

Need help? Contact us!