VeloDB and Apache Doris Get Single-Binary CDC with Supermetal

When teams run CDC, they often spend more time managing the pipeline than the database and workloads it serves. Today, that changes for Apache Doris and VeloDB users with the introduction of our Supermetal integration.

Supermetal now replicates operational data into Apache Doris and VeloDB Cloud from all five of its supported sources: Postgres, MySQL, MongoDB, SQL Server, and Oracle. CDC updates and deletes land through Doris's native Merge-on-Write model, so queries read current state with zero read-time deduplication or filtering.

You run one Rust binary for both the initial snapshot and ongoing change capture. It loads into Doris through the S3 table-valued function (TVF) or Stream Load, and Kafka, Flink, and the JVM drop out of the architecture entirely. Initial syncs that take hours through a Debezium and Flink pipeline finish in minutes.

Single-Process CDC

A typical CDC stack for Doris chains four systems in sequence:

┌───────────┐     ┌───────────┐     ┌───────────┐     ┌─────────────┐     ┌───────────────┐
│ Source DB │ ──► │ Debezium  │ ──► │ Kafka     │ ──► │ Flink/Spark │ ──► │ Apache Doris  │
└───────────┘     └───────────┘     └───────────┘     └─────────────┘     └───────────────┘
                  (Source Conn.)  (Message Broker)  (Compute Cluster)

Every hop in that chain pays a serialize and deserialize pass. Debezium decodes the source's change log and writes row-oriented Avro or JSON to Kafka. The stream processor decodes from Kafka, transforms, then re-encodes the rows for Stream Load. Because the intermediate formats are row-oriented rather than columnar, each row gets encoded and decoded individually at every hop. At high row rates, that per-row tax dominates the cost of the pipeline.

Supermetal collapses the path into one process:

┌───────────┐     ┌────────────┐     ┌──────────────┐                ┌─────────────────┐
│ Source DB │ ──► │ Supermetal │ ──► │ Object Store │ ──────────────►│  VeloDB Cloud / │
└───────────┘     └────────────┘     │  (optional)  │     S3 TVF     │  Apache Doris   │
                                     └──────────────┘ or Stream Load └─────────────────┘
                                       S3 / Azure

Supermetal reads source rows into Apache Arrow record batches and writes Parquet. With an object storage buffer (S3 or Azure Blob Storage) configured, Doris pulls those Parquet files directly through the S3 TVF. Without a buffer, Supermetal stages Parquet on local disk and sends it to Doris through Stream Load.

Updates and Deletes

For tables with a primary key, Supermetal creates a Doris Unique Key table with Merge-on-Write and sets _sm_version as the sequence column. Because _sm_version derives from the source's transaction-log position, updates merge in the correct order even when batches retry. Deletes set Doris's hidden DORIS_DELETE_SIGN column. This is the ingestion pattern the Unique Key model was designed for: the merge cost is paid at write time rather than at query time, so reads return current state directly and stay fast under update-heavy workloads.

For tables without a primary key, Supermetal creates a Duplicate Key table. Every CDC operation appends a row, with _sm_version and _sm_deleted exposed as regular columns for read-side filtering.

Performance

The Supermetal team benchmarked snapshot and CDC from Postgres into a VeloDB Cloud trial warehouse on the TPC-H dataset. The numbers below come from their published runs.

Benchmark setup

Source	Supermetal	Target
Postgres on AWS RDS	AWS EC2	VeloDB Cloud
db.m5.2xlarge	m8azn.xlarge	VeloDB cloud-26.03
8 vCPU / 32 GB RAM	4 vCPU / 16 GB RAM	8 vCPU / 64 GB RAM
400 GiB gp3	Amazon Linux 2023	400 GB cache
us-west-2	us-west-2	us-west-2

Snapshot: 433M rows in 6 minutes 11 seconds

Throughput: Supermetal sustained roughly 1.5M rows per second (about 290 MB/sec) out of Postgres, consistent across all scale factors.
Parallelism: Supermetal chunks large tables and reads them concurrently. Smaller tables complete in seconds, while lineitem accounts for about 70% of the dataset and gates total wall-clock time.
Pipelining: Parquet stages to object storage while source reads are still running. Doris pulls the staged files in parallel through the S3 TVF, so ingestion overlaps the source read.

CDC latency under load

Throughput: Supermetal tracked the target rate at 100% from 1K through 30K ops/sec and hit 96% at the 40K tier. Postgres logical decoding bursts above the target during commit clusters, peaking around 67K rows/sec. Above 40K, single-threaded logical decoding saturates at about 40K rows/sec on this RDS configuration.
Latency (p100): End-to-end latency holds at 7 to 9 seconds from 1K through 25K ops/sec, bounded by Supermetal's flush interval (default 5 seconds). On the write side, Doris held p100 write latency under 2 seconds at every load tier.
Source-side read: Read latency p100 stays under 500 ms up to 15K ops/sec. Above 20K, the connector queue builds and read p100 climbs to about 5 seconds at 50K, a clean source-saturation signal.

Learn about the full performance details here from this supermetal blog.

Try Supermetal with VeloDB Today

The Doris target is available in Supermetal now. Install the binary, point it at your source database, and set VeloDB Cloud or your Doris cluster as the target. Your first sync can be running in minutes.

Unix:

Windows:

You can also grab a direct download, and Supermetal's Doris target documentation covers configuration in detail.

New to VeloDB? Start a free trial of VeloDB Cloud and put real-time analytics on top of the operational data you just replicated.