Back
Glossary

ORC Format

VeloDB Engineering Team· 2025/09/03

1. What is Apache ORC

Apache ORC (Optimized Row Columnar) is an open-source columnar storage format optimized for large-scale data storage and analytics. Developed by Hortonworks in 2013, it has become an Apache top-level project and is widely used in big data ecosystems including Apache Hive, Spark, Presto, Trino, and more.

As the native storage format for the Hive ecosystem, ORC has deep roots in the Hadoop ecosystem. It inherits the advantages of RCFile while making significant improvements in compression efficiency, query performance, and storage optimization, making it a crucial choice for enterprise-grade data warehouses.

ORC not only provides excellent compression ratios and query performance but also includes rich built-in statistics and indexing mechanisms, making it one of the core storage formats in modern data lake and data warehouse architectures.

2. Why Apache ORC is Needed

In the big data processing domain, traditional row storage formats have numerous limitations, and ORC's design goals are specifically to address these issues:

  • Low storage efficiency: Traditional formats have poor compression ratios, leading to high storage costs
  • Query performance bottlenecks: Unable to effectively leverage columnar storage advantages for query optimization
  • Lack of intelligent indexing: No built-in statistics and indexing mechanisms
  • Insufficient ACID support: Lack of transaction-level data consistency guarantees

Row Storage vs Columnar Storage ORC Optimization

Consider a sales data table:

Row Storage vs Columnar Storage ORC Optimization.png

Traditional Row Storage Issues

1001,C001,Laptop,999.0,1,2024-01-01 09:30:00
1002,C002,Mouse,25.5,2,2024-01-01 10:15:30
1003,C001,Keyboard,89.9,1,2024-01-01 11:22:45

When executing queries like SELECT SUM(price) FROM sales WHERE order_date > '2024-01-01', you must read entire rows of data, including unnecessary fields like customer_id, product_name, etc.

ORC Columnar Storage Advantages

ORC organizes data by columns:

  • order_id column: [1001, 1002, 1003]
  • price column: [999.0, 25.5, 89.9]
  • order_date column: [2024-01-01 09:30:00, ...]

ORC Columnar Storage Advantages.png

3. ORC Architecture and Core Components

ORC Architecture and Core Components.png

File Structure Hierarchy

  • Contains file metadata information such as schema, compression types, statistics, etc.
  • Stores location and size information for all Stripes
  • File reading starts from the Footer to quickly obtain file structure

2. Stripe

  • Basic storage unit of ORC files, similar to Parquet's Row Group
  • Default size is 64MB, configurable
  • Each Stripe contains multiple rows of data, supports parallel reading

3. Column Data

  • Column-stored data blocks within Stripes
  • Each column is independently compressed and encoded
  • Supports multiple encoding methods: dictionary encoding, RLE, Delta encoding, etc.

4. Index Data

  • Contains Row Group Index and Bloom Filter
  • Provides fine-grained data filtering capabilities
  • Supports skipping unnecessary data block reads

Component Relationship Summary:

  • One ORC File → Multiple Stripes
  • Each Stripe → Multiple Column Data + Index Data
  • Each Column Data → Multiple Row Groups (default 10,000 rows)

4. Key Features & Characteristics

Efficient Columnar Storage and Compression

ORC employs advanced compression and encoding techniques:

  • Multi-level compression: Supports ZLIB, Snappy, LZ4, ZSTD compression algorithms
  • Smart encoding: Dictionary encoding, Run Length Encoding (RLE), Delta encoding, bit packing, etc.
  • Adaptive compression: Automatically selects optimal encoding based on data characteristics
  • Excellent compression ratio: Typically saves 75%+ storage space compared to uncompressed text formats

ACID Transaction Support

ORC is the foundation for Hive ACID features:

  • Atomicity: Supports transaction-level data writes
  • Consistency: Ensures consistent data states
  • Isolation: Supports Multi-Version Concurrency Control (MVCC)
  • Durability: Ensures persistence of committed transactions

Built-in Statistics and Indexing

ORC files include rich built-in statistics:

  • Column statistics: min/max values, null counts, distinct value estimates
  • Bloom Filter: Quick determination of value existence, reducing unnecessary disk reads
  • Row Group Index: Creates indexes every 10,000 rows, supports fast skipping
  • Predicate Pushdown: Push filter conditions down to the storage layer

Schema Evolution Support

  • Add columns: Can add new columns at the end of files without affecting existing data
  • Remove columns: Ignore unnecessary columns through schema projection
  • Rename columns: Support column name changes (through position mapping)
  • Type promotion: Support compatible type conversions (e.g., INT → BIGINT)

5. Use Cases / Application Scenarios

Enterprise Data Warehouses

ORC has significant advantages in enterprise data warehouses:

  • Deep Hive ecosystem integration: As Hive's default storage format, seamlessly works with Hive MetaStore and HiveQL
  • Enterprise-grade features: Supports ACID transactions, row-level updates/deletes, meeting traditional data warehouse requirements
  • High compression ratio: Significantly reduces storage costs, especially suitable for long-term data archiving
  • Mature and stable: Long-term validation in Hadoop ecosystem, lower enterprise adoption risk

Batch Processing ETL Scenarios

In large-scale ETL processing, ORC provides excellent performance:

  • High-throughput writes: Optimized write paths supporting large batch data imports
  • Incremental updates: Combined with Hive ACID, supports INSERT, UPDATE, DELETE operations
  • Partition support: Perfect integration with Hive partitioned tables, supports partition pruning optimization
  • Compressed transmission: Reduces network transmission overhead, accelerates distributed computing

OLAP Analytical Queries

ORC's design is specifically optimized for analytical queries:

  • Column pruning: Only read columns involved in queries, significantly reducing I/O
  • Predicate pushdown: Utilize statistics and Bloom Filters for fast data filtering
  • Vectorized execution: Support SIMD instructions, improving CPU utilization
  • Parallel reading: Stripe-level parallel processing, fully utilizing multi-core resources

Integration with Table Formats

While ORC is powerful on its own, in modern data lake architectures, it's often combined with table formats:

Integration with Table Formats.png

6. Practical Examples

Reading ORC Format Data in Doris

SELECT * FROM S3 (
    'uri' = 's3://bucket/path/to/data.orc',
    'format' = 'orc',
    's3.endpoint' = 'https://s3.us-east-1.amazonaws.com',
    's3.region' = 'us-east-1',
    's3.access_key' = 'ak',
    's3.secret_key' = 'sk'
);

Exporting Data as ORC Format in Doris

SELECT customer_id, SUM(amount) as total_amount
FROM orders
WHERE order_date >= '2024-01-01'
GROUP BY customer_id
INTO OUTFILE "s3://bucket/result_"
FORMAT AS ORC
PROPERTIES (
    's3.endpoint' = 'https://s3.us-east-1.amazonaws.com',
    's3.region' = 'us-east-1',
    's3.access_key' = 'ak',
    's3.secret_key' = 'sk',
    'orc.compress' = 'ZSTD'
);

Creating ORC Tables in Hive

CREATE TABLE sales_orc (
    order_id BIGINT,
    customer_id STRING,
    product_name STRING,
    price DECIMAL(10,2),
    quantity INT,
    order_date TIMESTAMP
)
STORED AS ORC
TBLPROPERTIES (
    'orc.compress'='ZLIB',
    'orc.bloom.filter.columns'='customer_id,product_name'
);

Enabling ACID Transaction Support

-- Enable Hive ACID transactions
SET hive.support.concurrency=true;
SET hive.enforce.bucketing=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
SET hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;

-- Create ACID-enabled ORC table
CREATE TABLE sales_acid (
    order_id BIGINT,
    customer_id STRING,
    amount DECIMAL(10,2),
    order_date DATE
)
CLUSTERED BY (order_id) INTO 10 BUCKETS
STORED AS ORC
TBLPROPERTIES ('transactional'='true');

7. Key Takeaways

  • Apache ORC is a columnar storage format optimized specifically for the Hadoop ecosystem
  • Provides excellent compression ratios and query performance
  • Built-in rich statistics and indexing mechanisms
  • Native ACID transaction support, suitable for enterprise data warehouses
  • Deep integration with Hive ecosystem, mature and stable
  • Excellent performance in batch ETL and OLAP analytical scenarios

8. FAQ

Q: What are the differences between ORC and Parquet?

A: Both are columnar formats, but they have different strengths:

  • ORC: Better support in Hive ecosystem, built-in ACID transactions, higher compression ratios
  • Parquet: Broader ecosystem, better Spark integration, stronger cross-platform compatibility

Q: Does ORC support nested data types?

A: Yes, ORC supports complex data types including:

  • STRUCT (structures)
  • ARRAY (arrays)
  • MAP (mappings)
  • UNION (union types)

Q: How to choose ORC compression algorithms?

A: Recommendations based on scenarios:

  • ZLIB: Balanced compression ratio and speed, default choice
  • Snappy: Fast compression/decompression speed, medium compression ratio
  • LZ4: Extremely fast decompression speed, suitable for frequently accessed data
  • ZSTD: Best compression ratio, suitable for archived data

Q: Are ORC files compatible across different versions?

A: ORC is backward compatible; newer versions can read older version files. However, it's recommended to:

  • Maintain ORC version consistency
  • Use schema evolution features for structural changes
  • Test compatibility in production environments