What is Lakehouse
1. Introduction
In the evolution of data architecture, Data Warehouse and Data Lake have respectively taken on the roles of structured analytics and large-scale raw data storage. However, with the rapid growth of data types, scale, and processing requirements, the boundaries between these two have begun to blur, exposing numerous issues: data silos, redundant storage, complex ETL pipelines, etc. To bridge this divide, the Lakehouse architecture has emerged.
Lakehouse (Data Lake + Data Warehouse) is a unified architecture that aims to provide data warehouse-level transactional capabilities, management capabilities, and query performance on top of a data lake foundation. It not only retains the low cost and flexibility of data lakes but also provides the consistency and high-performance analytical capabilities of data warehouses.
Lakehouse is rapidly becoming the new standard for big data infrastructure, widely adopted in modern enterprise data platform architectures.
2. Why Do We Need Lakehouse?
Pain Points of Traditional Architectures
- Co****mplexity from multiple systems coexistence: Heavy ETL/ELT processes lead to high data latency and error-prone operations.
- Data redundancy and consistency issues: Data is duplicated across warehouses and lakes, causing storage waste and version inconsistencies.
- Difficult cost control: Traditional warehouses are expensive, with costs skyrocketing when facing massive data volumes.
Improvements Brought by Lakehouse
- Unified storage format, eliminating data silos
- Introduction of ACID transactions, ensuring data consistency
- Support for efficient batch processing + real-time streaming
- Lower costs, stronger scalability
- Open ecosystem, compatible with various query engines
3. Lakehouse Architecture and Core Components
Overall Architecture Diagram
Below is an example of a Lakehouse architecture with Apache Doris as the query engine:
Core Component Introduction
-
Open Table Format Layer
- Apache Iceberg: Supports schema evolution, snapshots, time travel, suitable for large-scale batch processing and streaming writes.
- Delta Lake: Developed by Databricks, emphasizes ACID transactions and unified streaming-batch processing.
- Apache Hudi: Focuses on near real-time data ingestion, suitable for Change Data Capture (CDC) scenarios.
-
Query Engine Layer
- Apache Spark: The most commonly used Lakehouse compute engine, with native support for Iceberg/Delta/Hudi.
- Trino / Presto: Distributed SQL query engine that can directly read Lakehouse table formats.
- Apache Flink: Supports streaming writes and queries, suitable for real-time processing.
- Apache Doris: Supports accessing Iceberg/Delta/Hudi/Paimon data through Catalog federation, providing high-performance query services.
- DuckDB: Embedded OLAP database with excellent plugin extensibility, capable of accessing lake format data like Iceberg.
-
Storage Layer
- Typical object storage such as Amazon S3, HDFS, Google Cloud Storage, Alibaba OSS.
- Data files are typically in columnar storage formats like Parquet, ORC.
-
Metadata and Transaction Management
- Iceberg/Delta/Hudi all provide specifications, interfaces, and internal implementations for metadata and transaction management. They support mechanisms like snapshots, transaction logs, manifest files, commit timelines, etc.
- Catalog is used to uniformly manage metadata for multiple tables, supporting cross-engine data access, permission control, and version tracking. Mainstream Catalog components include: AWS Glue, Unity Catalog, Polaris, Gravitino, etc.
-
Consumption and Service Integration
- Deep integration with BI tools (Tableau, Superset), Notebooks (Jupyter), AI/ML workflows.
- Most table formats support DataFrame API, SQL API, REST Catalog, etc.
4. Key Features and Capabilities
ACID Transaction Support
Lakehouse supports consistency for insert, update, and delete operations, relying on underlying transaction log mechanisms (such as Delta's _delta_log
, Iceberg's manifest).
Time Travel and Version Control
Users can query historical data by snapshot ID or timestamp, for example:
-- Query snapshot data from June 1, 2024
SELECT * FROM user_logs VERSION AS OF TIMESTAMP '2024-06-01 00:00:00';
Schema Evolution
Allows safe addition, deletion, and renaming of columns without rewriting the entire table. For example:
ALTER TABLE orders ADD COLUMN discount DOUBLE;
Streaming + Batch Unification
Taking Hudi as an example, its writes support MOR (Merge-on-Read) mode, which combined with Kafka can achieve near real-time consumption.
Multi-Engine Access
Supports query engines like Trino, Presto, Spark, Doris to read the same Iceberg/Delta/Hudi data, greatly enhancing ecosystem compatibility.
5. Use Cases / Scenarios
Unified Storage and Compute for Big Data Platforms
Integrate enterprise data originally scattered across Hive, Kafka, ClickHouse, Doris, and other systems under the Lakehouse architecture.
Real-time Data Warehouse / Real-time Data Lake Updates
Combine Flink, Kafka, Hudi, or Delta Streaming to achieve near real-time data updates and analytics.
ML Feature Store and Versioning
Through Iceberg snapshots and metadata mechanisms, machine learning features can be managed with traceable version control.
Query Acceleration + Multi-dimensional Analytics
Combined with engines like Trino and Doris, build ultra-fast query acceleration layers suitable for high-concurrency OLAP scenarios like advertising, recommendations, and risk control.
6. Practical Examples
Apache Doris + Iceberg + AWS S3 Tables Demo
Details in Video
7. Key Takeaways
- Lakehouse merges the flexibility of data lakes with the strong consistency of data warehouses, representing the direction of modern data architecture.
- Supports ACID transactions, time travel, schema evolution, solving traditional lake-warehouse challenges.
- Open ecosystem: Iceberg, Delta, Hudi form the table format foundation of Lakehouse, compatible with mainstream compute engines.
- Supports unified real-time and batch processing, helping build end-to-end data platforms.
- Suitable for various business scenarios including data middle platforms, real-time data warehouses, and machine learning.
8. FAQ
Q: What are the main differences between Lakehouse and Data Warehouse ?
A: Lakehouse uses open data lake storage and table formats while providing transaction and version control capabilities, whereas traditional warehouses are closed systems.
Q: How should we choose between Iceberg, Delta, and Hudi?
A: Iceberg is suitable for batch processing large tables; Delta fits Spark scenarios; Hudi is better for real-time updates.
Q: Will Lakehouse replace traditional data warehouses?
A: In many scenarios, Lakehouse has become a more flexible alternative, but traditional warehouses still have advantages in extreme performance OLAP scenarios.