VeloDB Performance Series Part 2: Lakehouse Query Performance, Analytics Over Open Table Formats

In Part 1 of this series, we benchmarked VeloDB on real-time analytics, focusing on data freshness under heavy updates, query concurrency, and join performance.

In Part 2 of this performance series, we will be shifting focus to lakehouse analytics. Open table formats like Apache Iceberg and Delta Lake have crossed the line from industry buzzword to production standard. The 2025 State of the Apache Iceberg Ecosystem survey found that Iceberg is now the default format for new analytical workloads in many organizations. The broader lakehouse market, valued at roughly $4.75 billion in 2025, is projected to exceed $13 billion by 2033.

The appeal is straightforward: open formats give you data flexibility, multi-engine interoperability, and freedom from vendor lock-in. However, openness comes with a performance tax. When your data sits on S3 rather than local SSDs, every query incurs the cost of that distance. At the same time, how data is ingested into Iceberg tables and lakehouse is often overlooked.

3 Performance SLAs to Examine on Lakehouse

Just as Part 1 defined three SLAs for real-time analytics, lakehouse workloads have their own performance needs. These are the questions every data team asks before committing to an open architecture.

1. Query Performance Over Open Table Formats on Object Storage

Can you query open table format data on S3 as fast as native database tables? These formats store data as Parquet or ORC files on remote object storage, and every query must contend with three layers of overhead.

Remote storage latency. Each I/O request to S3 or HDFS incurs a network round-trip. A single Parquet row group read that takes microseconds on a local NVMe takes milliseconds over the network. Even a modest query can trigger dozens of sequential HTTP range requests before processing any data, a pattern engineers call "request amplification." Multiply that across thousands of row groups and the gap adds up to seconds.

File parsing overhead. Parquet and ORC files require decompression, column mapping, and type conversion before data reaches the compute layer. Generic parsing libraries introduce redundant memory copies and format conversions that become bottlenecks at scale.

Metadata bloat. Each write, schema change, or compaction in Iceberg creates new metadata files. In high-ingest environments, these structures can grow to millions of entries, so the query planner spends seconds on file-listing traversal before scanning a single byte of actual data.

2. Price Performance

Raw query speed is only half the equation. The real question is: can you get that speed without blowing your compute budget?

Lakehouse queries are I/O-heavy by nature. They pull large volumes of Parquet data from object storage, decompress it, and run complex analytics on top. That I/O intensity translates directly into compute hours, and cloud data warehouses charge premium rates for those hours. A team running hundreds of TPC-DS-scale queries per day can see costs diverge by tens of thousands of dollars per year, depending on which engine they choose.

The metric that matters is not just "how fast" but "how fast per dollar." A system that completes all 99 TPC-DS queries in the same time as a competitor, but at half the hourly compute rate, delivers twice the value.

3. Write Performance

A lakehouse engine that can only read is incomplete. Data must flow into open table format tables efficiently, or you're stuck running a separate ingestion cluster (typically Spark or Flink) alongside your query engine.

Writing to open table formats is not trivial. Each write must create new data files, update metadata, generate new snapshots, and maintain ACID transactional guarantees. For formats like Iceberg, that means atomically committing manifest files, updating the snapshot log, and handling concurrent write conflicts. An engine that treats writes as an afterthought will either be slow or create a mess of small files that degrade downstream read performance.

A unified engine that handles both reads and writes well eliminates a second system, a second cost center, and the staleness that comes with batch-only ingestion.

Benchmarks to Understand Key SLAs

We selected Product D (a unified data intelligence platform), a leading commercial lakehouse solution, for comparative testing. The benchmark was conducted using the industry-standard TPC-DS 1TB dataset in Iceberg format, covering all 99 queries.

Benchmark 1: Query Performance and Cost

Addresses: Query performance over open table formats + Price Performance

Setup:

Tests were conducted in the AWS us-west-2 (Oregon) region, with data stored on S3 in Iceberg format. Each product was configured with cache acceleration to reflect real-world production performance. Additionally, we collected comprehensive statistics to ensure the query optimizers had sufficient information to optimize SQL queries. All 99 TPC-DS queries were executed 3 times consecutively, and the fastest single-run total time and the 3-run total cost were used for comparison.

Configuration	VeloDB	Product D
Compute Resources	~ 96C, 768GB	~96C, 744GB
Hourly Price	~ $10/hour	~ $10/hour
Dataset	TPC-DS 1TB, Iceberg Format
Cache	Local Disk Cache	Local Disk Cache
Statistics	Both executed ANALYZE TABLE

Result:

Metric	VeloDB	Product D
Fastest Single Run Total	144s	552s	3.83x
Single Run Cost	$0.4	$1.53
3-Run Cumulative Time	835	1675	2x
3-Run Total Cost	$2.32	$4.65

The benchmark results reveal a significant performance gap. VeloDB completed all 99 queries with the fastest single-run at 144s, outperforming Product D's fastest run (552s) by 3.83x.

The advantage is even more pronounced in terms of efficiency. VeloDB's total query cost for a single run was only $0.40, which is approximately 26% of Product D's cost ($1.53). This superior price-performance is driven by VeloDB's higher architectural efficiency and lower compute overhead. At enterprise scale, this 4x performance lead and massive cost saving translate into substantial operational advantages over time.

Benchmark 2: Iceberg Write Performance

Addresses: Write performance

Query speed dominates most lakehouse benchmarks, but a complete data platform must also handle ingestion. We benchmarked VeloDB against Product D by writing the full TPC-DS dataset into Iceberg tables.

Setup:

Both engines wrote the complete TPC-DS dataset (24 tables) into Iceberg format on S3. To maintain consistency, these write operations were performed using the same cluster compute resources as the query benchmarks.

Results:

Metric	VeloDB	Product D
Total Write Time	375s	367s
Tables Where the Engine Is Faster	19 of 24	5 of 24

Total write times are nearly identical (Product D holds a 2.5% edge, driven by a few massive fact tables). VeloDB outperformed Product D on 19 of the 24 tables.

The architectural takeaway matters more than the margin. With comparable read and write performance in a single engine, organizations can consolidate lakehouse operations onto VeloDB. No separate Spark cluster for ingestion, no extra cost center, and simpler data architecture. The Apache Doris community is also continuously optimizing the write pipeline, so users can expect further gains in upcoming releases

How VeloDB Delivers Lakehouse Performance

The benchmark results above don't come from any single trick. VeloDB's lakehouse performance is the product of a layered optimization system that spans the full query lifecycle: from data access to planning to execution to caching to data management. Each layer builds on the one before it.

Native Lakehouse Access via Multi-Catalog

Before any optimization can kick in, the engine must connect to your lake data. VeloDB provides native support for open table formats (Iceberg, Hudi, Delta Lake, and others) through its Multi-Catalog mechanism. You register an external catalog (Delta Lake Metastore, AWS Glue, or an Iceberg REST catalog such as Apache Polaris), and VeloDB makes all tables available via standard SQL. No data migration, no ETL pipelines, no format conversion. Just SELECT * FROM iceberg_catalog.db.table.

This is the entry point. Everything that follows (the optimizer, the native reader, the cache) operates on top of this catalog layer.

1. The General Query Engine

VeloDB brings years of OLAP query optimization to every lakehouse query. A key architectural decision: lake queries follow the exact same execution path as internal table queries. They don't get a simplified, second-class engine. They get the full optimizer and the full vectorized executor.

Cost-Based Optimizer

VeloDB's CBO applies the same global optimizations to lakehouse queries as to internal tables: join reorder, subquery unnesting, constant folding, and predicate derivation. Critically, VeloDB supports full statistics collection (ANALYZE TABLE) for open table format tables, including row counts, column-level NDV, NULL ratios, and histograms. Many competing engines fall back to heuristic estimation when querying lake tables. VeloDB does not. Accurate statistics mean the optimizer picks the right join order and access plan, which is especially important when fact tables span terabytes of data on remote storage.

Vectorized Execution Engine

The execution engine is implemented in C++ with fully vectorized, columnar processing that takes full advantage of SIMD instructions and CPU cache locality. The Pipeline execution model breaks queries into parallel execution units to maximize multi-core utilization.

Crucially, this engine does not care where the data came from. Once data enters memory, the same high-performance executor handles computation, aggregation, and joins whether the source was an internal table or an Iceberg table on S3. Performance differences between internal and lake queries stem from the I/O layer, which is precisely what lakehouse-specific optimizations address.

Runtime Filters and Dynamic Partition Pruning

┌─────────────────────────────────────────────────────────────────┐
│                       Join Query Execution                      │
│                                                                 │
│  ┌────────────────┐   Generate Runtime Filter   ┌────────────┐ │
│  │ Dimension Table │ ────────────────────────▶   │  Filter     │ │
│  │ (Build Side)    │                             │  Predicates │ │
│  │ (Small Scan)    │                             │  (Min/Max,  │ │
│  └────────────────┘                             │  Bloom,     │ │
│                                                  │  IN List)   │ │
│                                                  └──────┬─────┘ │
│                                                         │       │
│                                                Push to Scan     │
│                                                         ▼       │
│                                                ┌──────────────┐ │
│                                                │  Fact Table   │ │
│                                                │  (Probe Side) │ │
│                                                │  Skip Non-    │ │
│                                                │  matching I/O │ │
│                                                └──────────────┘ │
└─────────────────────────────────────────────────────────────────┘

The fastest way to process data is to not process it at all. In lakehouse scenarios, fact tables reside in remote storage, where every eliminated I/O request saves a network round trip. VeloDB's Runtime Filter mechanism builds a hash table on the smaller dimension table during a join, then generates filters that prune the larger fact table before it's scanned. In typical star schema queries, this reduces the volume of fact table scans by one to two orders of magnitude.

Global TopN Optimization

ORDER BY ... LIMIT N queries are increasingly common in AI retrieval and semantic search. VeloDB implements global late materialization: in multi-table joins, TopN filtering propagates globally, converging the candidate set on sort columns first, then reading remaining columns on demand. This cuts both compute and I/O.

┌──────────────────────────────────────────────────────────────┐
│             TopN Global Late Materialization                  │
│                                                              │
│  Step 1: Read only the sort column                           │
│  ┌────────────┐                                              │
│  │ Sort Column │ ──▶ Global Top K Sort ──▶ Candidate Row IDs │
│  └────────────┘                                              │
│                                                              │
│  Step 2: Backfill remaining columns for candidates only      │
│  ┌────────────┐                                              │
│  │ Other Cols  │ ◀── Read only Top K rows' data              │
│  └────────────┘                                              │
│                                                              │
│  ✅ Non-sort column reads: full table → Top K rows only      │
│     Remote I/O dramatically reduced                          │
└──────────────────────────────────────────────────────────────┘

2. Lakehouse-Specific Optimizations

The general engine provides a strong baseline, but lake queries hit unique bottlenecks at each phase of execution. VeloDB addresses each one.

Query Planning: Metadata Caching

Lake table metadata is often the first bottleneck. An Iceberg table's metadata is spread across metadata files, manifest lists, and manifest files. When partitions and files reach hundreds of thousands, the query hasn't started scanning data yet, and several seconds may have passed just figuring out what to scan.

VeloDB reduces this overhead with a multi-level metadata cache: schema, partition info, and file listings are cached independently; only changed portions trigger remote fetches (incremental sync); and partition pruning and file filtering occur directly from the local cache. The result is that lake query planning speed approaches that of internal tables.

Data Reading: Native Reader

VeloDB built a custom C++ Native Reader with optimizations at every step of the read pipeline:

Pre-read skipping: Row Group statistics (min/max, Bloom Filters) and Page-level indexes filter out irrelevant data before decompression even begins.
Zero-copy mapping: Parquet/ORC columnar data maps directly to VeloDB's internal Column Block format, bypassing intermediate copies and type conversions.
Merge I/O: Multiple small read requests consolidate into a single large sequential read, cutting round trips to remote storage.
Footer Cache: Parquet file footers (schema + Row Group statistics) are cached locally, avoiding repeated tail reads on every query.

Caching: Bringing Hot Data Local

Even after Native Reader optimizations, remote storage latency is a physical constraint. Reading a data block from S3 takes milliseconds; local NVMe operates at microseconds. When a query touches thousands of blocks, that gap accumulates to seconds.

Memory (Page Cache / Footer Cache)
    ↓ miss
Local Disk (File Cache, SSD/NVMe)
    ↓ miss
Remote Storage (S3 / HDFS / OSS)

VeloDB addresses this with a multi-tier caching hierarchy:

Hot data hits memory cache, delivering internal-table-equivalent performance.
Warm data hits local disk cache (LRU-managed), with I/O speeds far exceeding remote storage.
Cold data goes remote on first access, but still benefits from predicate pushdown and Merge I/O.

Over time, frequently queried datasets migrate into cache automatically. Repeated queries run at near-local speed.

3. Materialized Views: A Transparent Acceleration Layer

Beyond optimizing the query pipeline, VeloDB offers a more aggressive approach: materialized views built on top of lake tables.

You define a materialized view based on a lake table query (an aggregation, a multi-table join, or both). VeloDB materializes the results into its high-performance internal table format on local disk. When a subsequent query semantically matches the view, VeloDB's optimizer automatically rewrites it to use the view. No SQL changes required. The user's original query runs as written; the optimizer handles the rest. Apache Doris documentation reports that transparent rewriting has improved query speed by up to 93x in tested scenarios, though the actual speedup depends on the ratio of pre-computed work to original query complexity.

Materialized views are particularly effective for:

High-frequency BI queries: The same aggregation running repeatedly in dashboards drops from seconds to sub-second.
Complex join pre-computation: Expensive multi-table joins are materialized once and reused, avoiding the join cost on every query.
AI feature pre-aggregation: Commonly used feature engineering results are cached for rapid consumption by model training and online inference.

4. Table Management: Optimizing Data at the Source

Query tuning can only go so far if the underlying data is poorly organized. Data in Lakehouse is typically ingested by external tools like Spark and Flink that are optimized for write throughput rather than read patterns. That leaves two common problems: too many small files (each requiring a separate I/O request) and unordered data (which renders min/max statistics useless for skipping).

VeloDB provides table management capabilities that address both:

Compaction: Merges a large number of small files into fewer, larger files, reducing I/O requests and metadata pressure.
Data Sorting: Physically sorts lake table data by specified columns. Sorted files have much higher min/max selectivity, so predicate pushdown and Row Group skipping become far more effective. In filtered queries, sorting alone can reduce scan volume by an order of magnitude.
Data Writing: Supports INSERT INTO and CREATE TABLE AS SELECT to write directly into open table format tables with full transactional consistency.
Snapshot Management: Cleans up expired snapshots and associated data files, keeping metadata lean and query planning fast.
Schema Evolution: Adds columns, drops columns, and changes types, fully compatible with native lake format schema evolution.

Together, these capabilities make VeloDB a complete lakehouse platform, not just a query engine. It covers data access, writing, organization, optimization, and lifecycle management.

What These Results Mean for Your Business

For analytics teams: Faster iteration cycles on ad-hoc queries. When response times drop from double-digit seconds to single-digit seconds, analysts stay in flow rather than context-switching while they wait. Over hundreds of queries per day, that adds up.

For infrastructure and finance leaders: The TCO comparison is worth modeling against your own workloads. Based on benchmark pricing, a workload costing $50,000/year on Product S would cost roughly $22,000 on VeloDB. Your actual numbers will vary, but the direction is consistent across the tests we ran.

For architects: The benchmarks suggest VeloDB's performance profile is systemic rather than dependent on a single optimization. And with write performance on par with the leading competitor tested, VeloDB can potentially serve as a single engine for the full lakehouse lifecycle: reads, writes, compaction, and table management in one system instead of several.

Try It Yourself

Transparency matters in benchmarking. We encourage you to verify these results on your own infrastructure:

Get started: If you don't have an instance yet, try VeloDB for free.
Run the benchmarks: Follow our reproducible TPC-DS benchmark guide to replicate these results.
Talk to us: If you're evaluating lakehouse engines for production, schedule a conversation with our team.
For a deeper look at the foundations behind these results, you can also read Part 1 of this series: VeloDB Performance Series Part 1: Real-Time Analytics for Data Freshness, Concurrency, and Joins

Next in the VeloDB Performance Series: search and log analytics for observability workloads. Stay tuned.