Apache Doris 4.1 Spill to Disk: Running Memory-Intensive Queries Without OOM

Picture an analytics team running its end-of-quarter close. The ETL job joins three years of transaction history against a customer dimension and a product hierarchy, then aggregates by region for the executive deck. Three hours in, the database server runs out of memory, and the query dies with an OOM error. The team retries with a smaller date range, splits the job in half, and ships the deck the next morning instead of that evening. If the spill-to-disk feature had been enabled, that job would have finished, and there would be no morning panic.

Apache Doris 4.1 release matured this feature for the Doris community.

TL'DR of what changed in 4.1:

Full operator coverage: Hash Join, Aggregation, and Sort all spill. These three operators account for almost all memory pressure in analytical queries.
Recursive repartitioning: When a recovered partition is still too large for memory, the engine repartitions it again. This handles heavy data skew, not just average-case workloads.
Proactive triggering: Doris detects memory pressure early and starts spilling before hitting a hard limit, rather than waiting for a query to fail and then trying to recover.

The rest of this post covers what Spill to Disk is, which workloads benefit most from it, how Apache Doris implements it end-to-end, the configuration and observability surface, and the full TPC-DS benchmark.

1. Memory-Intensive Operators and Why They Need Spill to Disk

In Apache Doris, three operators sit at the top of the OOM-risk list. They share one trait: each must accumulate state over the entire input before producing results. Memory usage scales with input size, cardinality, or skew, so the worst case sits far above the average. Spill to Disk targets exactly these three. The following operators most likely to hit memory limits are those that need to accumulate state:

Operator	Memory Bottleneck
Hash Join	Build side requires loading the full dataset into a hash table
Aggregation	Maintains a hash table of group keys and intermediate aggregation state
Sort	Must buffer all data before sorting and outputting

These three share two characteristics that make them candidates for spilling:

Memory usage scales with data size: Input volume, cardinality, and data skew all affect how much memory they consume.
State is serializable: The state these operators maintain doesn't need to stay in memory permanently. It can be written to disk and read back on demand.

The core concept behind spill to disk is to convert revocable memory into disk state, then read it back in smaller streaming batches during the recovery phase.

2. How Apache Doris Implemented Spill to Disk

Doris does not implement spill inside each operator in isolation. A control layer and an operator layer coordinate the work, and share a common infrastructure layer beneath them.

The four layers:

Control layer: PipelineTask handles the reserve entry point and memory pressure detection. WorkloadGroupMgr manages paused queries and decides whether to spill or cancel. QueryTaskController selects specific spill targets and creates SpillContext.
Operator layer: Three core operators: Partitioned HashJoin, Partitioned Aggregation, and Spill Sort, each implement revoke_memory() to handle their own data persistence and recovery.
Infrastructure layer: SpillFileManager manages disk directories and garbage collection. SpillFile, SpillFileWriter, and SpillFileReader provide block-level read/write abstractions. SpillRepartitioner handles multi-level repartitioning.
Memory management layer: Three-level memory checks (Query Limit → Workload Group Limit → Process Limit) form a layered memory defense.

3. The Trigger Flow: Reserve, Pause, Spill, Resume

The entire Spill to Disk process follows a pause → persist → resume cycle.

Overview

Overall trigger flow: reserve, pause, spill, resume

Reserve Memory: Three-Level Validation

Before each execution round, PipelineTask estimates how much memory the operator will need for the next step, then calls try_reserve() to reserve that amount. This isn't a real malloc, it's a pre-check to determine whether execution should continue before entering operator logic.

try_reserve() validates three constraints simultaneously:

Three-level reserve validation

Key design note: When enable_reserve_memory is on, QueryContext will disable the query tracker's immediate hard check. Apache Doris prefers that queries enter a "pausable and spillable" path rather than failing immediately when a limit is hit.

Proactive Spill Trigger

Beyond the passive trigger (reserve failure), PipelineTask also checks proactively whether spill should be initiated early. Spill triggers when all three conditions are met simultaneously:

Condition	Description
reserve_size × parallelism > query_limit / 5	The estimated reserve for this round is large relative to the query limit (exceeds 20% across concurrency dimensions)
System under high memory pressure	Query memory usage ≥ 90%, or Workload Group touches low/high watermark
revocable_mem × parallelism ≥ query_limit × 20%	Current revocable memory in the pipeline is large enough to be worth spilling

The goal of the proactive spill trigger is to avoid waiting until memory allocation fails. It proactively converts revocable memory to disk state when it detects memory under high pressure.

Paused Query State

When a reserve fails or proactive revoking triggers, the query enters a Paused state:

Records the pause reason (QUERY_MEMORY_EXCEEDED / WG_MEMORY_EXCEEDED / PROCESS_MEMORY_EXCEEDED)
Enables low memory mode
Blocks on memory_sufficient_dependency

The query stops executing and waits for WorkloadGroupMgr to handle it.

WorkloadGroupMgr Decision Logic

WorkloadGroupMgr follows different branches depending on the pause reason.

WorkloadGroupMgr decision flow

Two key constraints to keep in mind:

Only Paused Queries are eligible for spill triggering by WorkloadGroupMgr.
QueryTaskController requires that there are no running tasks in a fragment before starting revocation. Apache Doris doesn't allow uncoordinated memory reclamation of a query's state while it's actively executing.

QueryTaskController: Selecting Spill Targets

QueryTaskController::revoke_memory() executes the following logic:

Collect all revocable tasks across all fragments
Sort by revocable size in descending order
Select a batch of tasks: target recovery amount = current actual memory consumption × 20% (uses actual consumption, not the limit. For example, if the limit is 1.6 GB but only 1 GB is in use, the target recovery amound is 200 MB, not 300 MB)
Create a SpillContext for the selected tasks (with a completion callback)
Call each task's revoke_memory(), conditional on revocable_mem_size >= MIN_SPILL_WRITE_BATCH_MEM (512 KB)
When all tasks complete, the SpillContext callback fires → memory_sufficient = true → query resumes scheduling

4. Shared Infrastructure

Although each operator implements spill differently, they share a common infrastructure layer.

SpillFileManager

Role: Global management of spill storage directories and disk resources.

                     SpillFileManager
                          │
          ┌───────────────┼───────────────┐
          │               │               │
    ┌─────▼─────┐  ┌─────▼─────┐  ┌─────▼─────┐
    │ SpillData │  │ SpillData │  │ SpillData │
    │ Dir (SSD) │  │ Dir (SSD) │  │ Dir (HDD) │
    └───────────┘  └───────────┘  └───────────┘

Key behaviors:

Disk selection: Prefers SSD. Selects among available disks in ascending order of usage rate.
Background GC: Periodically scans and cleans expired spill directories.
Capacity management: Checks remaining disk space via reach_capacity_limit().
Read/write metrics: Maintains global counters for spill read and write byte totals.

SpillFile, SpillFileWriter, and SpillFileReader

SpillFile is a logical file abstraction that can automatically split into multiple parts under the hood:

SpillFile internals

This design lets upper-level operators focus on "write block / read block" without managing file splitting or metadata.

Fault tolerance: SpillFile's destructor automatically calls gc() to delete the spill directory and release the disk quota. Even if the writer doesn't close normally, byte counts continue to be tracked, ensuring GC has accurate disk usage statistics.

SpillRepartitioner

SpillRepartitioner is the core building block for multi-level spill in Apache Doris, primarily used by Hash Join and Aggregation:

SpillRepartitioner

Key characteristics:

Two modes: Join mode (uses PartitionerBase for hashing) and Aggregation mode (uses key column index for CRC32 hashing).
Incremental processing: Processes approximately 32 MB at a time, then yields CPU back to the scheduler to avoid long blocking.
Persistent Reader: Maintains read position across yields, avoiding redundant re-reads.

5. How Spill to Disk Works with Each Operator

Hash Join:

Hash Join is the most complex part of Apache Doris's spill implementation because it requires coordination between the build and probe sides, and recovery may require recursive repartitioning.

Normal Path vs. Spill Path

First Spill: Build Side Switches from Full Hash Table to Partitioned Writes

First spill (is_spilled == false): Extract build data from the internal hash table → partition by join key → write each partition to its own SpillFile → disable Runtime Filter (the build side is no longer fully resident in memory)
Subsequent spills (is_spilled == true): Continue flushing the current in-memory partition blocks to their corresponding spill files

Why the Probe Side Also Spills

Once a build-side partition has been written to disk, the probe side cannot use the current hash table for that partition, because the table no longer contains the complete build data. The probe operator partitions using the same join key, writes probe rows for spilled partitions to disk, and continues joining non-spilled partitions against the in-memory hash table.

Recovery Phase: Processing Partitions One at a Time

When input is exhausted, the probe side assembles, builds, and probe spill files into a unified Partition Queue (_spill_partition_queue). Each element is a JoinSpillPartitionInfo containing both a build_file and a probe_file. Each partition is processed in four steps:

Recover build blocks from the build spill file into memory
Build the internal hash table from those build blocks
Recover probe blocks from the probe spill file
Complete the join for that partition and output results

Recursive Repartitioning: Handling Data Skew

If a recovered partition's build data is still too large for available memory, SpillRepartitioner splits the current build/probe partition into next-level sub-partitions (level + 1), which are placed back into _spill_partition_queue. This repeats up to spill_repartition_max_depth levels.

This means Apache Doris' Hash Join spill is not a simple write to disk, read back operation, it's partitioned execution with recursive partitioned recovery as needed.

Aggregation:

This section covers the Partitioned Aggregation spill. Streaming Aggregation does not support spill; related parameters only limit its memory usage.

Sink Side: Serialize Group State and Partition to Disk

When revoke_memory() triggers: iterate the current hash table → partition by group key hash → serialize key columns and aggregate state into blocks → write each partition to a SpillFile → reset the hash table.

Key point: Aggregation spill writes intermediate aggregation state, not final results. The same group may be spilled multiple times, and recovery requires re-merging.

Source Side: Recover by Partition, Merge, Output

Read a batch of blocks from the spill file
merge_with_serialized_key: merge the serialized key/state back into the in-memory hash table
Repeat until the current partition is fully read
Produce final results from the hash table
Reset the hash table and continue to the next partition

If the hash table grows large again during recovery, SpillRepartitioner creates sub-partitions, which are placed back into _partition_queue.

Aggregation recovery doesn't pipe spill file data directly to downstream operators. It first reconstructs the hash table, then produces output normally from that table.

Order By/Sort:

Spill for Sort is different from Join or Aggregation. It follows a classic External Merge Sort model, where the core unit is a sorted run rather than a partition.

Basic Approach

External Merge Sort flow

Sink Side

Each revoke_memory() trigger:

Mark shared state as spilled, record limit/offset
Create a new SpillFile
Call the internal Sorter's prepare_for_spill() to complete sort preparation
Loop reading sorted blocks and write them as one sorted run to disk
Reset the sorter and continue receiving new input

Source Side: Multi-Way Merge

Fewer runs: Use VSortedRunMerger directly for the final merge.
Too many runs: Select a batch of runs for an intermediate merge to generate a temporary spill file, then repeat until the run count is small enough.

The number of files that can be merged in parallel = spill_sort_merge_mem_limit_bytes ÷ spill_buffer_size_bytes (minimum 8). The user query's limit/offset is applied only in the final merge round.

Operator Comparison Summary

6. Key Configuration Parameters

The most important parameters in the current implementation:

Parameter	Default	Description
`enable_spill`	`false`	Enable spill to disk
`enable_force_spill`	`false`	Force the spill path regardless of reserve and WG watermark checks. Testing only.
`enable_reserve_memory`	`true`	Enable the "reserve first, then execute" mechanism
`spill_min_revocable_mem`	`4MB`	Minimum revocable memory considered worth spilling
`spill_buffer_size_bytes`	`8MB`	Spill block size; also affects the sort merge read buffer
`spill_hash_join_partition_count`	`4`	Initial partition count for joins
`spill_aggregation_partition_count`	`4`	Initial partition count for aggregations
`spill_repartition_max_depth`	`8`	Maximum recursive repartitioning depth for joins and aggregations
`spill_join_build_sink_mem_limit_bytes`	`64MB`	Memory threshold above which the join build sink proactively flushes after entering spilled state
`spill_aggregation_sink_mem_limit_bytes`	`64GB`	Proactive flush threshold for the aggregation sink
`spill_sort_sink_mem_limit_bytes`	`64MB`	Proactive flush threshold for the sort sink
`spill_sort_merge_mem_limit_bytes`	`64MB`	Total memory budget for the sort merge phase

Key Workload Group watermark parameters:

memory_low_watermark
memory_high_watermark
slot_memory_policy

Note: low_watermark and high_watermark don't directly mean "start spilling." They signal high memory pressure to the system, which then influences how paused queries are handled.

7. Observability

When a spill occurs, these tools are available for monitoring and debugging:

Query Profile: Spilled operators show "Spilled": "true" in their profile, along with timers like SpillWriteSerializeBlockTime and SpillMergeSortTime. These help identify where disk write latency is concentrated.
Logs: Key spill events are logged at INFO level (e.g., "all spill tasks done, resume it"). More detailed partition-level information is available by enabling VLOG_DEBUG.
Force spill testing (enable_force_spill): Setting this to true bypasses reserve and Workload Group watermark checks and forces all spill-capable operators through the disk path. Useful for verifying spill correctness in testing without needing to reproduce real memory pressure.

8. Benchmark: TPC-DS 10TB on a Single Node

Test Environment and Configuration

Test environment:

1 FE (8C16G) + 1 BE (8C16G, 500GB SSD)
Labeled as 1BE 10T TPC-DS test

Session variables used:

set global enable_profile = true;
set global spill_hash_join_partition_count = 8;
set global broadcast_row_count_limit = 3000000.0;
set global query_timeout = 28800;
set global profile_level = 2;
set global enable_spill = true;
set global spill_join_build_sink_mem_limit_bytes = 1073741824;
set global spill_aggregation_partition_count = 8;
set global spill_aggregation_sink_mem_limit_bytes = 1073741824;
set global exec_mem_limit = 8589934592;
set global max_file_scanners_concurrency = 8;

Workload Group configuration:

alter workload group normal properties(
    'slot_memory_policy' = 'dynamic'
);

BE configuration:

disable_storage_page_cache = true
spill_storage_limit = 95%
spill_storage_root_path = /mnt/disk1/storage
mem_limit = 14G
max_sys_mem_available_low_water_mark_bytes = 268435456
soft_mem_limit_frac = 0.95
disable_segment_cache = true
max_external_file_meta_cache_num = 64

Schema Setup

Data was prepared using an external catalog:

CREATE CATALOG `dlf` PROPERTIES (
    "type" = "hms",
    "ipc.client.fallback-to-simple-auth-allowed" = "true",
    "hive.metastore.type" = "dlf",
    "dlf.uid" = "217316283625971977",
    "dlf.secret_key" = "xxxx",
    "dlf.region" = "cn-beijing",
    "dlf.proxy.mode" = "DLF_ONLY",
    "dlf.endpoint" = "dlf-vpc.cn-beijing.aliyuncs.com",
    "dlf.catalog.id" = "emr_dev",
    "dlf.access_key" = "xxxxx"
);

Query Results

Query	Timestamp	Time_ms	PeakMemoryBytes	ScanBytes	SpillWriteBytesToLocalStorage	SpillReadBytesFromLocalStorage	ScanBytesFromRemoteStorage
Query1	2026-03-11 10:09:04.678	249511	6.09 GB	25.43 GB	11.81 GB	11.81 GB	25.43 GB
Query2	2026-03-11 10:13:14.208	288281	2.01 GB	84.17 GB	0	0	84.17 GB
Query3	2026-03-11 10:18:02.509	309598	560.97 MB	194.32 GB	0	0	194.32 GB
Query4	2026-03-11 10:23:12.145	5096701	8.10 GB	747.52 GB	263.04 GB	277.70 GB	747.52 GB
Query5	2026-03-11 11:48:08.881	560352	4.67 GB	518.04 GB	0	0	518.04 GB
Query6	2026-03-11 11:57:29.252	296185	3.61 GB	168.82 GB	0	0	168.82 GB
Query7	2026-03-11 12:02:25.451	623932	1.28 GB	367.66 GB	0	0	367.66 GB
Query8	2026-03-11 12:12:49.410	214793	1.21 GB	126.25 GB	0	0	126.25 GB
Query9	2026-03-11 12:16:24.225	1904681	6.45 GB	1094.79 GB	0	0	1094.79 GB
Query10	2026-03-11 12:48:08.925	262662	5.59 GB	66.55 GB	472.18 MB	470.99 MB	66.55 GB
Query11	2026-03-11 12:52:31.604	3030705	8.00 GB	249.68 GB	96.46 GB	100.90 GB	249.68 GB
Query12	2026-03-11 13:43:02.327	89220	887.83 MB	62.61 GB	0	0	62.61 GB
Query13	2026-03-11 13:44:31.562	701969	1.98 GB	462.63 GB	0	0	462.63 GB
Query14	2026-03-11 13:56:13.560	2362297	7.34 GB	777.25 GB	1.36 GB	1.36 GB	777.25 GB
Query14_1	2026-03-11 14:35:35.881	2029864	6.96 GB	834.18 GB	0	0	834.18 GB
Query15	2026-03-11 15:09:25.761	108941	5.16 GB	44.86 GB	0	0	44.86 GB
Query16	2026-03-11 15:11:14.717	2709611	6.53 GB	194.81 GB	52.38 GB	52.37 GB	194.81 GB
Query17	2026-03-11 15:56:24.345	1436164	5.90 GB	357.26 GB	52.10 GB	52.10 GB	357.26 GB
Query18	2026-03-11 16:20:20.525	663571	8.24 GB	249.82 GB	1.94 GB	1.94 GB	249.82 GB
Query19	2026-03-11 16:31:24.111	443592	5.05 GB	277.74 GB	0	0	277.74 GB
Query20	2026-03-11 16:38:47.718	175846	914.79 MB	121.82 GB	0	0	121.82 GB
Query21	2026-03-11 16:41:43.579	12984	1.12 GB	6.44 GB	0	0	6.44 GB
Query22	2026-03-11 16:41:56.576	37106	2.07 GB	6.46 GB	0	0	6.46 GB
Query23	2026-03-11 16:42:33.704	8484933	8.17 GB	506.22 GB	434.40 GB	434.39 GB	506.22 GB
Query23_1	2026-03-11 19:03:58.655	9580074	8.10 GB	507.09 GB	495.48 GB	495.46 GB	507.09 GB
Query24	2026-03-11 21:43:38.748	1402735	5.33 GB	259.23 GB	35.90 GB	35.90 GB	259.23 GB
Query24_1	2026-03-11 22:07:01.500	1373745	5.32 GB	259.20 GB	35.92 GB	35.91 GB	259.20 GB
Query25	2026-03-11 22:29:55.260	1420038	5.58 GB	492.23 GB	36.24 GB	36.24 GB	492.23 GB
Query26	2026-03-11 22:53:35.313	306443	1.37 GB	189.45 GB	0	0	189.45 GB
Query27	2026-03-11 22:58:41.771	569707	2.69 GB	339.86 GB	0	0	339.86 GB
Query28	2026-03-11 23:08:11.494	1978179	3.50 GB	968.46 GB	0	0	968.46 GB
Query29	2026-03-11 23:41:09.688	3629652	5.52 GB	361.30 GB	202.97 GB	202.96 GB	361.30 GB
Query30	2026-03-12 00:41:39.356	74567	7.46 GB	13.95 GB	2.55 GB	2.55 GB	13.95 GB
Query31	2026-03-12 00:42:53.938	338011	5.58 GB	169.21 GB	0	0	169.21 GB
Query32	2026-03-12 00:48:31.965	192334	432.80 MB	121.80 GB	0	0	121.80 GB
Query33	2026-03-12 00:51:44.318	588839	3.47 GB	475.30 GB	0	0	475.30 GB
Query34	2026-03-12 01:01:33.174	439078	5.05 GB	74.20 GB	947.62 MB	947.52 MB	74.20 GB
Query35	2026-03-12 01:08:52.411	421239	6.61 GB	66.72 GB	5.38 GB	5.38 GB	66.72 GB
Query36	2026-03-12 01:15:53.671	560406	2.19 GB	357.64 GB	0	0	357.64 GB
Query37	2026-03-12 01:25:14.091	125019	717.86 MB	68.33 GB	0	0	68.33 GB
Query38	2026-03-12 01:27:19.123	963870	8.03 GB	67.00 GB	25.08 GB	25.08 GB	67.00 GB
Query39	2026-03-12 01:43:23.012	10118	3.90 GB	6.17 GB	0	0	6.17 GB
Query39_1	2026-03-12 01:43:33.156	7378	4.14 GB	6.17 GB	0	0	6.17 GB
Query40	2026-03-12 01:43:40.548	772176	4.99 GB	132.54 GB	22.24 GB	22.24 GB	132.54 GB
Query41	2026-03-12 01:56:32.740	381	219.99 MB	0	0	0	0
Query42	2026-03-12 01:56:33.135	355028	705.32 MB	242.20 GB	0	0	242.20 GB
Query43	2026-03-12 02:02:28.179	233200	1.40 GB	71.82 GB	0	0	71.82 GB
Query44	2026-03-12 02:06:21.393	1641881	1.27 GB	0	0	0	0
Query45	2026-03-12 02:33:43.290	102558	5.62 GB	55.25 GB	0	0	55.25 GB
Query46	2026-03-12 02:35:25.862	845444	6.90 GB	242.34 GB	1.26 GB	1.26 GB	242.34 GB
Query47	2026-03-12 02:49:31.326	659900	1.83 GB	204.06 GB	0	0	204.06 GB
Query48	2026-03-12 03:00:31.241	468186	1.22 GB	249.58 GB	0	0	249.58 GB
Query49	2026-03-12 03:08:19.445	1103747	4.54 GB	759.59 GB	0	0	759.59 GB
Query50	2026-03-12 03:26:43.210	579398	4.50 GB	233.52 GB	0	0	233.52 GB
Query51	2026-03-12 03:36:22.630	2519960	7.57 GB	243.48 GB	29.05 GB	29.05 GB	243.48 GB
Query52	2026-03-12 04:18:22.605	350151	718.79 MB	242.20 GB	0	0	242.20 GB
Query53	2026-03-12 04:24:12.772	339872	618.42 MB	204.05 GB	0	0	204.05 GB
Query54	2026-03-12 04:29:52.662	425603	4.80 GB	265.87 GB	0	0	265.87 GB
Query55	2026-03-12 04:36:58.280	347111	752.24 MB	242.20 GB	0	0	242.20 GB
Query56	2026-03-12 04:42:45.407	626281	2.79 GB	476.64 GB	0	0	476.64 GB
Query57	2026-03-12 04:53:11.704	309493	2.10 GB	99.12 GB	0	0	99.12 GB
Query58	2026-03-12 04:58:21.216	475025	2.30 GB	402.98 GB	0	0	402.98 GB
Query59	2026-03-12 05:06:16.257	566388	1.61 GB	71.82 GB	0	0	71.82 GB
Query60	2026-03-12 05:15:42.660	590653	3.19 GB	476.60 GB	0	0	476.60 GB
Query61	2026-03-12 05:25:33.328	882132	4.36 GB	592.47 GB	0	0	592.47 GB
Query62	2026-03-12 05:40:15.477	85728	1.77 GB	26.01 GB	0	0	26.01 GB
Query63	2026-03-12 05:41:41.221	336360	720.48 MB	204.06 GB	0	0	204.06 GB
Query64	2026-03-12 05:47:17.598	1559295	8.00 GB	617.30 GB	2.15 GB	2.15 GB	617.30 GB
Query65	2026-03-12 06:13:17.169	2315970	7.42 GB	408.10 GB	60.44 GB	60.44 GB	408.10 GB
Query66	2026-03-12 06:51:53.162	298893	3.04 GB	221.15 GB	0	0	221.15 GB
Query67	2026-03-12 06:56:52.069	9408439	6.81 GB	228.59 GB	373.37 GB	373.36 GB	228.59 GB
Query68	2026-03-12 09:33:40.523	1177568	5.38 GB	366.06 GB	534.01 MB	533.98 MB	366.06 GB
Query69	2026-03-12 09:53:18.106	124151	1.13 GB	925.67 MB	0	0	925.67 MB
Query70	2026-03-12 09:55:22.272	545681	1.31 GB	146.79 GB	0	0	146.79 GB
Query71	2026-03-12 10:04:27.968	613861	2.00 GB	453.86 GB	0	0	453.86 GB
Query72	2026-03-12 10:14:42.009	8243155	7.93 GB	182.69 GB	131.67 GB	131.65 GB	182.69 GB
Query73	2026-03-12 12:32:05.186	435543	5.01 GB	74.11 GB	944.57 MB	944.48 MB	74.11 GB
Query74	2026-03-12 12:39:20.752	1726486	7.75 GB	173.95 GB	65.54 GB	70.36 GB	173.95 GB
Query75	2026-03-12 13:08:07.261	3557968	8.00 GB	565.43 GB	196.27 GB	196.26 GB	565.43 GB
Query76	2026-03-12 14:07:25.249	579602	2.47 GB	465.38 GB	0	0	465.38 GB
Query77	2026-03-12 14:17:04.874	455135	4.22 GB	450.39 GB	0	0	450.39 GB
Query78	2026-03-12 14:24:40.031	17452931	7.98 GB	591.57 GB	1035.60 GB	1035.53 GB	591.57 GB
Query79	2026-03-12 19:15:32.983	800549	8.00 GB	241.61 GB	1.15 GB	1.15 GB	241.61 GB
Query80	2026-03-12 19:28:53.555	2741950	7.85 GB	830.76 GB	121.44 GB	121.44 GB	830.76 GB
Query81	2026-03-12 20:14:35.526	90319	6.61 GB	24.75 GB	2.19 GB	2.19 GB	24.75 GB
Query82	2026-03-12 20:16:05.864	220234	919.21 MB	133.12 GB	0	0	133.12 GB
Query83	2026-03-12 20:19:46.119	54538	1.61 GB	34.45 GB	0	0	34.45 GB
Query84	2026-03-12 20:20:40.676	29747	1.01 GB	14.12 GB	0	0	14.12 GB
Query85	2026-03-12 20:21:10.443	195031	6.93 GB	117.46 GB	0	0	117.46 GB
Query86	2026-03-12 20:24:25.492	135832	1.48 GB	63.11 GB	0	0	63.11 GB
Query87	2026-03-12 20:26:41.342	903848	8.00 GB	67.03 GB	19.99 GB	19.99 GB	67.03 GB
Query88	2026-03-12 20:41:45.212	1366476	3.91 GB	274.00 GB	0	0	274.00 GB
Query89	2026-03-12 21:04:31.711	375284	1.22 GB	204.05 GB	0	0	204.05 GB
Query90	2026-03-12 21:10:47.015	65010	947.87 MB	31.68 GB	0	0	31.68 GB
Query91	2026-03-12 21:11:52.045	26913	817.90 MB	0	0	0	0
Query92	2026-03-12 21:12:18.977	97142	481.83 MB	62.60 GB	0	0	62.60 GB
Query93	2026-03-12 21:13:56.138	559349	4.51 GB	277.95 GB	0	0	277.95 GB
Query94	2026-03-12 21:23:15.507	1096824	6.34 GB	89.60 GB	22.16 GB	22.15 GB	89.60 GB
Query95	2026-03-12 21:41:32.353	13163692	6.99 GB	94.48 GB	81.08 GB	80.97 GB	94.48 GB
Query96	2026-03-13 01:20:56.065	161809	562.34 MB	34.28 GB	0	0	34.28 GB
Query97	2026-03-13 01:23:37.893	4394223	8.24 GB	256.93 GB	246.50 GB	246.45 GB	256.93 GB
Query98	2026-03-13 02:36:52.136	367533	1.13 GB	242.21 GB	0	0	242.21 GB
Query99	2026-03-13 02:42:59.752	179826	1.65 GB	42.15 GB	0	0	42.15 GB

Important limitation: The Intersect/Except operator does not yet support spill. SQL involving that logic was rewritten using equivalent Join syntax for this test.

This reflects the current coverage boundary: spill mainly covers Join, Agg, and Sort. Set operators are not yet fully supported.

System Resource Usage

Memory usage:

CPU usage:

Representative Test Cases

Three representative SQL types cover the most important spill scenarios:

q14 / q14_1: Centered on cross_items and avg_sales CTEs, combining multi-table joins, Union All, aggregation, Having, Rollup, Order By, and Limit in a complex execution path.
q38: Joins three customer sets from store_sales, catalog_sales, and web_sales, validating execution under complex join combinations.
q87: Uses Left Join + IS NULL to express semantics that would normally require Intersect/Except, a workaround for the current Set spill limitation.

These cases validate that Apache Doris can stably combine and execute multiple operator spill mechanisms in real TPC-DS-style SQL.

Summary

To summarize the Apache Doris Spill to Disk mechanism in one sentence: Use Reserve Memory to detect risk early; WorkloadManager pauses the query; specific operators convert their revocable state into SpillFiles; and during recovery, that state is read back in smaller-granularity batches.

The overall flow:

PipelineTask::execute()
  ├── get_reserve_mem_size()  ──→  estimate memory needs for this round
  ├── _should_trigger_revoking()  ──→  check for high memory pressure
  │     ├── reserve × parallelism > query_limit / 5 ?
  │     ├── query_used ≥ 90% or WG touches watermark?
  │     └── revocable × parallelism ≥ query_limit × 20%?
  ├── [high pressure] add_paused_query(QUERY_MEMORY_EXCEEDED)
  │     └── WorkloadGroupMgr::handle_paused_queries()
  │           ├── already recovered → resume
  │           ├── has revocable → QueryTaskController::revoke_memory()
  │           │     ├── sort tasks by revocable size
  │           │     ├── target recovery = consumption × 20%
  │           │     └── PipelineTask::do_revoke_memory()
  │           │           └── per-operator revoke_memory()
  │           └── no revocable → disable reserve, may cancel
  └── [normal] try_reserve() → continue execution

The three core operators differ in approach:

Join: Partition by join key, coordinate spill across both build and probe sides, recursively repartition on recovery as needed.
Aggregation: Serialize intermediate aggregation state and partition by key. During recovery, merge back into the hash table before producing output.
Sort: Each spill produces one sorted run. Final output uses multi-way merge.

In addition, SpillIcebergTableSinkOperatorX (be/src/exec/operator/spill_iceberg_table_sink_operator.cpp) also implements revoke_memory(), supporting spill for Iceberg table write scenarios.

Operators that currently do not support spill include Window Functions / Analytic Functions and Set operators (Union/Intersect/Except). When these operators run out of memory, queries still terminate with an error.

The most important thing to understand about Apache Doris's Spill to Disk design: it's deeply embedded in Doris' execution engine, tightly coupled with PipelineTask, MemTracker, WorkloadGroup, and Operator State, making it a core scheduling mechanism in the execution engine itself.

Join the Apache Doris community on Slack and connect with Doris experts and users. If you're looking for a fully managed Apache Doris cloud service, contact the VeloDB team.