What is Delta Lake
Delta Lake is an open-source storage format that combines Apache Parquet files with a powerful metadata transaction log. It brings ACID transactions, consistency guarantees, and data versioning capabilities to data lakes. As large-scale data lakes have been widely deployed in enterprises, using Parquet alone cannot solve issues like performance bottlenecks, data consistency, and poor governance. Delta Lake emerged to address these challenges and has become an essential foundation for building modern lakehouse architectures.
What makes it worth attention is that it preserves the scalable and cost-effective characteristics of data lakes while providing reliability and performance guarantees similar to data warehouses, making it suitable for BI, data engineering, and AI/ML scenarios.
1. Why Do We Need Delta Lake?
-
Existing Challenges:
- In traditional data lakes, concurrent writes can easily lead to dirty data or partial commit failures due to lack of transactional support.
- No version management or rollback capabilities, making it difficult to audit historical data states.
- Schema consistency cannot be guaranteed, and writes can easily break the schema.
- Batch-stream separation increases system complexity and data latency.
-
Pain Point Improvements:
- Delta Lake provides ACID transaction capabilities, making concurrent writes safe and reliable.
- Supports historical version queries (Time Travel) for rollback, auditing, and experiment reproduction.
- Enforces schema validation and evolution mechanisms to ensure data quality.
- The same table supports both batch processing and streaming writes, simplifying ETL and architecture management.
2. Delta Lake Architecture and Core Components
The overall architecture relies on cloud object storage (such as S3, ADLS, HDFS) and the Parquet file storage layer, with Delta Log, Checkpoint, optimization modules, and other mechanisms introduced at the upper layer.
Core Components
-
Data Storage Layer (Parquet Files)
- Definition: Stores actual columnar data in Parquet file format.
- Function: Efficient compression, supports predicate pushdown, providing high query performance.
- Relationship: Managed by Delta Log for file paths and version relationships of added, deleted, or modified files.
-
Delta Log (Transaction Log)
- Definition: Metadata logs that record each transaction change in JSON and Parquet checkpoint file formats.
- Function: Implements ACID transactions, Time Travel, version control, and audit history.
- Relationship: All write operations (INSERT, UPDATE, DELETE, MERGE) generate log entries.
-
Checkpoint
- Definition: Current state metadata snapshots generated after a certain number of transactions, stored as Parquet files.
- Function: Accelerates log reading, avoids processing all logs from the beginning, and performs metadata pruning for large tables.
- Relationship: Works with Delta Log to achieve efficient version reading.
-
Schema Enforcement and Evolution
- Definition: Schema metadata stored in Delta Log.
- Function: Validates field types during writes to prevent non-compliant data; supports forward evolution.
- Relationship: Schema updates are recorded in the log, and both reads and writes follow the latest or specified schema.
-
Optimization Modules (Compaction, Z-Ordering, Data Skipping)
- Definition: Data layout and file fragment optimization techniques.
- Function: Merges small files, improves locality, reduces scan volume, thereby enhancing query efficiency.
- Relationship: Optimizer collaborates with execution engines by maintaining optimization layout hints in the log.
3. Key Features and Capabilities
ACID Transactions
Ensures that insert, update, delete, and merge operations have atomicity, consistency, isolation, and durability. For example, when executing writes through Spark, only fully successful transactions are committed.
Schema Enforcement & Evolution
Automatically validates schema during writes, and schema changes (such as Int → BigInt) can evolve smoothly without service interruption.
Time Travel
Supports querying historical data states by specifying version numbers or timestamps, facilitating auditing or rolling back erroneous operations.
Example SQL:
SELECT * FROM table VERSION AS OF 15;
SELECT * FROM table TIMESTAMP AS OF '2025-07-01';
DML Operations (Update / Delete / Merge)
Supports complex operations, such as MERGE for CDC and SCD type scenarios:
MERGE INTO target AS t
USING updates AS u
ON t.id = u.id
WHEN MATCHED THEN UPDATE ...
WHEN NOT MATCHED THEN INSERT ...
Suitable for data consistency synchronization and slowly changing dimension processing.
Unified Batch and Stream Processing
A single table can serve as both source and target for batch and stream processing, supporting exactly-once semantics and simplifying ETL pipeline design.
Performance Optimization
Through techniques like Compaction, Z-Ordering, and Data Skipping, significantly improves query speed and controls storage costs.
Delta Sharing
Supports cross-organizational data sharing protocols, enabling secure Delta table sharing without data copying, compatible with Apache Spark, Rust, Power BI, and other tools.
4. Use Cases
- Auditing & Rollback: Leverage Time Travel to restore historical versions when data errors occur or track change sources.
- CDC / SCD Pipelines: MERGE operations combined with schema evolution to manage source system data changes.
- Real-time Analytics & Reporting: Simultaneously receive streaming and batch inputs to achieve near real-time report updates for analytical instances.
-
AI / ML Feature Engineering Platform: Through Delta-rs and Arrow integration, accelerate read/write performance for tools like DuckDB and Polars.
5. Practical Examples
This section demonstrates how to write data using Spark with Delta Lake and query through Apache Doris's Delta Lake Catalog for integrated analysis.
Step 1: Create Delta Lake Table and Write Data (Spark)
import io.delta.tables._
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder()
.appName("DeltaDorisExample")
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
.master("local[*]")
.getOrCreate()
// Create and write initial data
spark.sql("CREATE TABLE delta_events (id INT, name STRING) USING delta LOCATION '/tmp/delta/events'")
spark.sql("INSERT INTO delta_events VALUES (1, 'alpha'), (2, 'beta')")
Then insert another batch of new data and perform updates:
spark.sql("INSERT INTO delta_events VALUES (3, 'gamma')")
spark.sql("UPDATE delta_events SET name = 'beta2' WHERE id = 2")
This creates multiple versions (Version 0, Version 1, Version 2).
Step 2: Configure Delta Lake Catalog in Doris
CREATE CATALOG IF NOT EXISTS delta_lake_ctl PROPERTIES (
'type' = 'trino-connector',
'trino.connector.name' = 'delta_lake',
'trino.hive.metastore' = 'thrift',
'trino.hive.metastore.uri' = 'thrift://<hive-metastore-host>:9083',
'trino.hive.config.resources' = '/path/core-site.xml,/path/hdfs-site.xml'
);
Once this catalog is established, you can access Delta Lake tables managed by the specified Hive Metastore.
Step 3: Query Delta Lake Tables in Doris
-- Switch catalog and database
SWITCH delta_lake_ctl;
USE default;
-- Query Delta table
SELECT * FROM delta_events LIMIT 10;
Also supports full path invocation:
SELECT * FROM delta_lake_ctl.default.delta_events;
The above statement allows querying Delta table data written by Spark from Doris, automatically reading the latest version state content.
Result Demonstration
- Spark writes create multi-version Delta tables and can use time travel to read historical states;
- Doris automatically reads the latest version table data through Delta Lake Catalog, supporting standard SQL queries;
-
Doris can perform joint analysis with internal tables/Doris-stored data without copying data from Delta Lake.
6. Key Takeaways
- Delta Lake enhances data reliability and consistency on data lakes, providing full transaction support.
- It implements key features like unified batch-stream processing, time travel, schema management, and performance optimization.
- Suitable for scenarios including audit rollback, data change management, real-time analytics, and AI/ML feature pipelines.
- Competes with formats like Iceberg and Hudi, but rapidly expands through ecosystem completeness, rich interfaces (like Rust API), and Databricks' large customer base.
-
Open-source project governance is transparent, supports UniForm format interoperability, and enables cross-engine collaborative usage.
7. FAQ
Q: What's the difference between Delta Lake and Iceberg?
A: Delta Lake is backed by Databricks, emphasizing ACID, Time Travel, and real-time analytics; Iceberg is supported by the Apache Foundation, features metadata decoupling, supports complex partition evolution, and has better multi-engine compatibility.
Q: Can I use only OSS Delta without deploying on Databricks?
A: Yes. Delta Lake OSS can already be used in Spark, Flink, Hive, Trino, Doris, and even PyTorch.