Back

Apache Hive

What is Apache Hive

Apache Hive is a distributed, fault-tolerant data warehouse system built on Hadoop that supports reading, writing, and managing massive datasets (typically at petabyte scale) using HiveQL, an SQL-like language. As big data scales continue to grow exponentially, enterprises increasingly demand familiar SQL interfaces for processing enormous datasets. Hive emerged precisely to address this need, delivering tremendous productivity value.

1. Current State, Evolution, and Development of Hive

Improvements Brought by Hive

Built on Hadoop with support for Tez or MapReduce execution engines, Hive dramatically lowered the barrier to large-scale data analysis, enabling data analysis teams to work without understanding the underlying MapReduce mechanisms.

  • Provides SQL-like interface HiveQL, reducing complexity and enabling SQL users to get started quickly.
  • Automatically compiles HiveQL into MapReduce or Tez jobs, hiding tedious implementation details.
  • Supports partitioning, bucketing, columnar storage formats (ORC, Parquet, etc.), and compression, significantly improving query performance.

Limitations of Hive

  • Unsuitable for real-time processing / No OLTP support: Hive is designed for OLAP rather than online transaction processing, lacking support for low-latency writes or frequent updates in real-time scenarios.
  • High query latency with obvious performance bottlenecks: Hive queries typically go through MapReduce or Tez engines, resulting in long response times; lacks efficient indexing mechanisms with limited subquery support.
  • Complex and inefficient update/delete operations: Although Hive supports ACID transactions, update/delete operations require full partition or table rewrites, incurring significant performance overhead.
  • Insufficient cloud object storage support: Hive Metastore was primarily designed for HDFS and doesn't adapt well to cloud storage scenarios, presenting performance and flexibility limitations.
  • Limited syntax and functionality: HiveQL doesn't fully support SQL standards, lacks stored procedures and triggers, and has limited indexing strategies, requiring reliance on partitioning and bucketing for optimization.

Evolution Relationship Between Hive and Lake Formats like Iceberg and Hudi

  • Data lake format background: As businesses demanded higher concurrent writes, schema evolution, and transaction support, legacy Hive formats proved inadequate, driving the emergence of open-source projects like Iceberg and Hudi.
  • Apache Hudi: Initiated by Uber, supports Copy-On-Write and Merge-On-Read modes, optimizing real-time or incremental write scenarios, suitable for near-real-time data lake update requirements (such as CDC and streaming writes).
  • Apache Iceberg: Proposed by Netflix, addresses Hive's transactionality, safe concurrent writes, and query planning efficiency issues, supporting ACID transactions, schema evolution, hidden partitioning, and multi-engine compatibility (Spark, Flink, Trino, Doris, etc.).
  • Evolution trends: Hudi focuses more on streaming and update scenarios, while Iceberg concentrates on analytical workloads and multi-engine governance scenarios. These formats have become important cornerstones of modern data lake/lakehouse architectures.

Hive Metastore

  • De facto industry standard catalog service: Although Hive table formats and query engines are gradually being replaced, Hive Metastore (HMS) as a metadata storage system remains widely shared by tools like Iceberg, Hudi, Spark, Trino, and Doris.
  • Broad compatibility and maturity: The vast majority of Iceberg deployments still use Hive Metastore as the metadata control plane, capable of managing table structures, partitions, snapshot information, etc., serving as a core component of current metadata management platforms.
  • Many enterprises still rely on Hive storage and catalogs: Numerous companies store massive data as Hive table structures, and dependency on Hive Metastore hasn't disappeared.
  • Gradual migration of query and storage patterns: While Hive query engines are still used in batch ETL and reporting scenarios, more enterprises combine Iceberg or Hudi table formats with Spark SQL, Trino, and Doris engines to replace the Hive execution layer.
  • Future role transformation: Hive will increasingly serve as a catalog and compatibility layer rather than the primary analytics platform, with Metastore continuing to serve the lakehouse ecosystem while computation tasks like queries and updates are primarily handled by new technologies.

2. Apache Hive Architecture and Core Components

The overall architectural logic is as follows:

  • Client → HiveServer2 → Compiler/Optimizer → Execution Engine (Tez/MapReduce/Spark/MR3) → Storage Layer (HDFS or cloud storage)
  • Metadata is managed through Metastore, with HCatalog and WebHCat providing cross-tool access capabilities.

Core Components

Hive Client (Hive Client / Beeline)

  • Definition: Users or applications submit queries through JDBC/ODBC, Thrift clients, or Beeline.
  • Function: Provides command-line or API interfaces to connect to HiveServer2.
  • Relationship: Sends SQL requests to HiveServer2.

HiveServer2 (HS2)

  • Definition: Core service handling concurrent multi-client requests.
  • Function: Receives requests, creates sessions, invokes compiler to generate execution plans, and submits YARN jobs. Supports authentication and access control.
  • Relationship: Client → HS2 → Compiler/Execution Engine.

Hive Compiler + Optimizer

  • Definition: Converts HiveQL statements into execution plans.
  • Function: Parses syntax, performs semantic checks, generates logical plans, optimizes (such as predicate pushdown, join reordering), and finally forms DAG physical plans.
  • Relationship: Uses Metastore for metadata information, outputs to execution engine.

Execution Engine

  • Definition: Runtime environment that actually executes Hive query tasks, including MapReduce, Tez, Spark, MR3, etc.
  • Function: Based on the DAG output from the compiler, schedules tasks to run on Hadoop clusters or Kubernetes environments.
  • Relationship: Consumes jobs generated by the compiler, reads storage layer data, and returns execution results.

Metastore (Hive Metastore)

  • Definition: Hive's central metadata repository, typically running on relational databases (MySQL, PostgreSQL, or Derby).
  • Function: Stores table definitions, column information, partitions, serialization information, data locations, etc.; provides Thrift interface for client queries.
  • Relationship: Critical dependency, accessed by HiveServer2, Compiler, HCatalog, and third-party tools.

HCatalog and WebHCat

  • HCatalog: Hive metadata layer providing consistent table structure access for Pig, MapReduce, Spark, etc.
  • WebHCat: REST-based HTTP interface providing metadata operations and job triggering services for Hive, Pig, and MapReduce.

3. Key Features and Capabilities

HiveQL (SQL-like Query Language)

  • Supports SQL syntax including SELECT, JOIN, GROUP BY, window functions, etc.; users familiar with SQL can quickly get started with Hive queries.

    Example:

    SELECT customer_id, COUNT(*) AS order_cnt
    FROM orders
    WHERE order_date >= '2025-01-01'
    GROUP BY customer_id;
    

Table Format and Compression Support

  • Supports multiple formats including TEXTFILE, ORC, Parquet, Avro, etc., configurable with compression algorithms (Snappy, Deflate, etc.) to improve storage and query efficiency.

Partitioning and Bucketing

  • Partitioning divides directories by columns, reducing data scanning volume; bucketing technology splits data to optimize join and sampling performance.

ACID Transactions and Data Compaction

  • Supports ACID transactions for ORC format, enabling insert, update, and delete operations with automatic background data compaction to ensure consistency and performance.

LLAP (Low Latency Analytical Processing)

  • Provides sub-second interactive query capabilities through caching and persistent query engines to accelerate response performance.

Security and Auditing

  • Integrates with Kerberos, Apache Ranger, and Atlas to implement authentication, authorization, metadata auditing, and observability.

4. Use Cases

Data Warehouse and ETL

  • Extract raw log data from HDFS/S3, transform it into structured tables via HiveQL for BI tool analysis.
  • Commonly used for batch analysis tasks, scheduled report generation, etc.

Cloud Data Lake Integration

  • Hive Metastore can be deployed on AWS Glue Catalog or local MySQL, working with EMR to analyze data on S3.

Integration with Iceberg or Hudi

  • Latest Hive versions support reading and writing Iceberg tables (through StorageHandler interface), compatible with modern lakehouse architectures.

5. Practical Examples

Through Apache Doris's Hive Catalog functionality, you can query and write to Hive tables, achieving Zero-ETL or data writeback operations.

Creating Hive Catalog

CREATE CATALOG hive_catalog PROPERTIES (
  'type' = 'hms',
  'hive.metastore.uris' = 'thrift://172.0.0.1:9083',
  'fs.defaultFS' = 'hdfs://namenode:9000'
);

Querying Hive Tables

SWITCH hive_catalog;

SELECT customer_id, SUM(amount) AS total_spent
FROM sales_db.sales_orc
WHERE order_date = '2025-07-30'
GROUP BY customer_id;

Doris will automatically synchronize table structures and metadata through Hive Catalog, generate execution plans, and directly read Hive ORC or Parquet files.

Writing Query Results to Doris Internal or Hive Tables

Writing to Doris Internal Tables

CREATE TABLE doris_sales_total (
  customer_id BIGINT,
  total_spent DOUBLE
)
DUPLICATE KEY(customer_id) DISTRIBUTED BY HASH(customer_id) BUCKETS 10;

INSERT INTO doris_sales_total
SELECT customer_id, SUM(amount)
FROM hive_catalog.sales_db.sales_orc
WHERE order_date = '2025-07-30'
GROUP BY customer_id;

Writing Back to Hive Tables

If the Hive table is writable, example of writing results back to Hive:

CREATE TABLE hive_catalog.sales_db.sales_total_orc
(
  customer_id BIGINT,
  total_spent DOUBLE
) ENGINE=hive
PROPERTIES (
  'file_format'='parquet'
);

INSERT INTO hive_catalog.sales_db.sales_total_orc
SELECT customer_id, SUM(amount)
FROM hive_catalog.sales_db.sales_orc
GROUP BY customer_id;

6. Key Takeaways

  • Hive provides SQL-like interface (HiveQL) and Hadoop ecosystem compatibility, suitable for batch analysis, ETL, and data warehouse scenarios.
  • Supports partitioning, columnar storage (like ORC/Parquet), ACID transactions, and LLAP interactive queries, improving performance and reliability under traditional Hive table formats.
  • Hive's limitations are significant, including high query latency, unsuitability for real-time writes/updates, support only for static partitioning with limited SQL standard compliance, high-overhead update/delete operations, and inadequate support for cloud storage scenarios.
  • As requirements evolved, Apache Hudi (streaming writes and real-time updates) and Apache Iceberg (transactionality, snapshot isolation, schema/partition evolution, multi-engine support) emerged to address Hive table format shortcomings. These lakehouse formats have become mainstream choices.
  • Despite table format evolution, Hive Metastore (HMS) remains the de facto standard for metadata management, widely shared and compatible across ecosystems including Iceberg, Hudi, Spark, and Trino.
  • Currently, massive amounts of enterprise data are still stored in Hive table structures with undiminished catalog dependencies; however, the query execution layer is gradually being replaced by Iceberg/Hudi table formats and modern engines (like Spark SQL, Trino, Doris), with Hive transforming into a metadata catalog compatibility layer.

7. Additional Resources and Next Steps