Solutions/Data Warehouse

Data Warehousing
on VeloDB

Sub-second analytics on petabyte-scale data. Open formats. Unified workloads. No vendor lock-in.

Run sub-second analytics across Doris tables and lakehouse tables like Iceberg, Hive, and Delta Lake with a single SQL statement
Handle structured tables, semi-structured JSON, and vector embeddings in one engine
Connect Tableau, Grafana, Superset, and DBT through MySQL-compatible protocol
VeloDB data warehouse on lakehouse architecture

The analytical database
your lakehouse
is missing

Open formats solved the storage and interoperability problem. VeloDB solves the performance, multimodal, and operational problems that come after

Lakehouse queries at warehouse speed
VeloDB accelerates queries on Iceberg, Hive, and Delta Lake data through async materialized views, multi-level caching, and vectorized MPP execution. Your data stays in open formats. Query performance reaches native warehouse speed.
Open by design, portable by default
VeloDB reads and writes Iceberg v3 natively with time travel and schema evolution. Delta Lake and Hive tables are queryable through multi-catalog federation. Polaris and Unity catalog integration fits existing governance. Apache 2.0 licensed, your data never enters a proprietary format.
Every data type queried in one place
VeloDB stores and queries tables, JSON, full-text, and vector embeddings together in standard SQL. VARIANT handles semi-structured data with automatic column extraction. Inverted indexes replace Elasticsearch for text search. Vector indexes enable similarity queries without a separate database.
Trusted in production

Data teams run on VeloDB

Xiaomi built a unified lakehouse on Doris and Paimon, cutting average query latency from 60 seconds to 10 seconds with 6x faster performance.

6x
Faster average query latency
5x
Higher concurrency vs Presto
40s→8s
Aggregation query time

We replaced separate Presto, Druid, and Spark clusters with one Doris engine over Paimon storage. Aggregation queries dropped from 40 seconds to 8 seconds. Concurrent query capacity scaled from 5 to 80 sessions.

Data Platform Team, Xiaomi
Global consumer electronics leader
Read the full story
XiaomiPlanetSF TechnologyHaidilaoTencent MusicMeituanByteDanceBaiduNetEaseKwaiJD.comTrip.comXiaomiPlanetSF TechnologyHaidilaoTencent MusicMeituanByteDanceBaiduNetEaseKwaiJD.comTrip.com
Real-world tradeoffs

Challenges with
lakehouse analytics at scale

01·Query speed
Lakehouse solved data sharing but query performance still falls short
Open lakehouse architectures centralized storage and eliminated data silos. But query performance for interactive workloads remains the unsolved problem.
Business users expect sub-second responses. Applications need low-latency concurrent access. The lakehouse can hold all the data, but serving it fast enough for these use cases still requires additional systems.
Tap to flip
How VeloDB solves it
Sub-second analytics on lakehouse data without moving it
VeloDB accelerates Iceberg, Hive, and Delta Lake queries through multi-level caching that avoids re-reading unchanged data and async materialized views that precompute expensive aggregations. The optimizer transparently rewrites queries to use cached or materialized results without changes to application SQL.
← Flip back
02·Multimodal
Each data type in the analytics stack requires its own system and its own pipeline
Structured data goes to the warehouse. Full-text goes to a search engine. Vector embeddings go to a dedicated vector database. JSON payloads get flattened or stored separately.
Each system has its own ingest path, query language, and consistency model. Answering questions that span multiple data types means querying multiple systems and joining results in application code.
Tap to flip
How VeloDB solves it
Structured, semi-structured, text, and vector data in one database
VeloDB stores and queries all data types together in standard SQL. VARIANT handles semi-structured JSON with automatic column extraction. Inverted indexes with BM25 scoring handle full-text search. HNSW and IVPQ vector indexes enable similarity queries. A single SQL statement can filter structured columns, search text, and rank vectors in one round trip.
← Flip back
03·Operational complexity
The warehouse is one system but operating it requires six more around it
A production lakehouse stack typically includes a caching layer for query performance, an ETL scheduler for materialized view refreshes, a coordination service, and separate monitoring for each component.
Every dependency has its own deployment, upgrade cycle, and failure modes. The warehouse itself is often the simplest part to operate.
Tap to flip
How VeloDB solves it
Built-in infrastructure that replaces external dependencies
VeloDB includes multi-level query caching, async materialized view refresh, and ZSTD compression natively. No Redis, no ETL scheduler, no ZooKeeper. Two node types and 60% or more storage reduction through columnar compression. Fewer systems to operate means fewer things to break and fewer engineers allocated to keeping the analytics stack running.
← Flip back
04·Vendor lock-in
Data becomes difficult to move once it enters a proprietary warehouse
Proprietary storage formats, proprietary catalogs, and proprietary query extensions create dependencies that grow over time. The longer data stays in the system, the more pipelines, dashboards, and models depend on it.
Migration means rewriting all of them. The cost of leaving eventually exceeds the cost of staying, even when the system no longer fits.
Tap to flip
How VeloDB solves it
Open source database with native open format support
VeloDB is built on Apache Doris, fully open source under Apache 2.0. It reads and writes Iceberg v3 natively with time travel, partition evolution, and schema evolution. Delta Lake and Hive Metastore are first-class. Polaris and Unity catalog compatibility means VeloDB fits into your existing catalog governance without migration. Your data stays in your storage, in formats any engine can read, with no lock-in at the engine, storage, or catalog level.
← Flip back
Architecture overview

VeloDB for lakehouse analytics

Whether you're accelerating queries over Iceberg tables, building dimensional models with materialized views, or querying structured, semi-structured, and vector data from one SQL interface, VeloDB handles it on one engine.

VeloDB lakehouse analytics engine
Data Sources
Iceberg Tables
v1 / v2 / v3
Hive Tables
Metastore catalog
Delta Lake Tables
Streaming lakehouse
MySQL / PostgreSQL
JDBC federation
S3 / HDFS
Direct file access
Lakehouse Layer
Multi-Catalog Federation
Iceberg, Hive, Delta Lake, JDBC
Native Format Readers
Parquet, ORC
Predicate Pushdown
Skip irrelevant row groups
Data Cache
96% hit rate in production
Internal Tables & Transforms
Unique Key + Merge-on-Write
Real-time upserts
VARIANT (JSON)
Auto column extraction
Vector Indexes
HNSW / IVPQ
Inverted Indexes
BM25 full-text search
Async Materialized Views
Flexible refresh
Query Engine
MPP + Vectorized
Pipeline parallelism
Cost-Based Optimizer
Auto MV rewrite
Runtime Filters
Reduce data transfer
Join Strategies
Broadcast · Shuffle · Colocate
Serve
Dashboards
Grafana, Superset, Tableau
BI Tools
MySQL-compatible protocol
Embedded Analytics
Customer-facing apps
APIs
REST / JDBC / ODBC
DBT Integration
Transform & model

Stop renting silos.
Run analytics on your data.

Spin up a VeloDB Cloud cluster in under 60 seconds and run your first sub-second query across Iceberg, Hive, or Delta Lake.

Need help? Contact us!