1. What is Apache ORC

Apache ORC (Optimized Row Columnar) is an open-source columnar storage format optimized for large-scale data storage and analytics. Developed by Hortonworks in 2013, it has become an Apache top-level project and is widely used in big data ecosystems including Apache Hive, Spark, Presto, Trino, and more.

As the native storage format for the Hive ecosystem, ORC has deep roots in the Hadoop ecosystem. It inherits the advantages of RCFile while making significant improvements in compression efficiency, query performance, and storage optimization, making it a crucial choice for enterprise-grade data warehouses.

ORC not only provides excellent compression ratios and query performance but also includes rich built-in statistics and indexing mechanisms, making it one of the core storage formats in modern data lake and data warehouse architectures.

2. Why Apache ORC is Needed

In the big data processing domain, traditional row storage formats have numerous limitations, and ORC's design goals are specifically to address these issues:

Low storage efficiency: Traditional formats have poor compression ratios, leading to high storage costs
Query performance bottlenecks: Unable to effectively leverage columnar storage advantages for query optimization
Lack of intelligent indexing: No built-in statistics and indexing mechanisms
Insufficient ACID support: Lack of transaction-level data consistency guarantees

Row Storage vs Columnar Storage ORC Optimization

Consider a sales data table:

Row Storage vs Columnar Storage ORC Optimization.png

Traditional Row Storage Issues

1001,C001,Laptop,999.0,1,2024-01-01 09:30:00
1002,C002,Mouse,25.5,2,2024-01-01 10:15:30
1003,C001,Keyboard,89.9,1,2024-01-01 11:22:45

When executing queries like SELECT SUM(price) FROM sales WHERE order_date > '2024-01-01', you must read entire rows of data, including unnecessary fields like customer_id, product_name, etc.

ORC Columnar Storage Advantages

ORC organizes data by columns:

order_id column: [1001, 1002, 1003]
price column: [999.0, 25.5, 89.9]
order_date column: [2024-01-01 09:30:00, ...]

ORC Columnar Storage Advantages.png

3. ORC Architecture and Core Components

ORC Architecture and Core Components.png

File Structure Hierarchy

1. File Footer

Contains file metadata information such as schema, compression types, statistics, etc.
Stores location and size information for all Stripes
File reading starts from the Footer to quickly obtain file structure

2. Stripe

Basic storage unit of ORC files, similar to Parquet's Row Group
Default size is 64MB, configurable
Each Stripe contains multiple rows of data, supports parallel reading

3. Column Data

Column-stored data blocks within Stripes
Each column is independently compressed and encoded
Supports multiple encoding methods: dictionary encoding, RLE, Delta encoding, etc.

4. Index Data

Contains Row Group Index and Bloom Filter
Provides fine-grained data filtering capabilities
Supports skipping unnecessary data block reads

Component Relationship Summary:

One ORC File → Multiple Stripes
Each Stripe → Multiple Column Data + Index Data
Each Column Data → Multiple Row Groups (default 10,000 rows)

4. Key Features & Characteristics

Efficient Columnar Storage and Compression

ORC employs advanced compression and encoding techniques:

Multi-level compression: Supports ZLIB, Snappy, LZ4, ZSTD compression algorithms
Smart encoding: Dictionary encoding, Run Length Encoding (RLE), Delta encoding, bit packing, etc.
Adaptive compression: Automatically selects optimal encoding based on data characteristics
Excellent compression ratio: Typically saves 75%+ storage space compared to uncompressed text formats

ACID Transaction Support

ORC is the foundation for Hive ACID features:

Atomicity: Supports transaction-level data writes
Consistency: Ensures consistent data states
Isolation: Supports Multi-Version Concurrency Control (MVCC)
Durability: Ensures persistence of committed transactions

Built-in Statistics and Indexing

ORC files include rich built-in statistics:

Column statistics: min/max values, null counts, distinct value estimates
Bloom Filter: Quick determination of value existence, reducing unnecessary disk reads
Row Group Index: Creates indexes every 10,000 rows, supports fast skipping
Predicate Pushdown: Push filter conditions down to the storage layer

Schema Evolution Support

Add columns: Can add new columns at the end of files without affecting existing data
Remove columns: Ignore unnecessary columns through schema projection
Rename columns: Support column name changes (through position mapping)
Type promotion: Support compatible type conversions (e.g., INT → BIGINT)

5. Use Cases / Application Scenarios

Enterprise Data Warehouses

ORC has significant advantages in enterprise data warehouses:

Deep Hive ecosystem integration: As Hive's default storage format, seamlessly works with Hive MetaStore and HiveQL
Enterprise-grade features: Supports ACID transactions, row-level updates/deletes, meeting traditional data warehouse requirements
High compression ratio: Significantly reduces storage costs, especially suitable for long-term data archiving
Mature and stable: Long-term validation in Hadoop ecosystem, lower enterprise adoption risk

Batch Processing ETL Scenarios

In large-scale ETL processing, ORC provides excellent performance:

High-throughput writes: Optimized write paths supporting large batch data imports
Incremental updates: Combined with Hive ACID, supports INSERT, UPDATE, DELETE operations
Partition support: Perfect integration with Hive partitioned tables, supports partition pruning optimization
Compressed transmission: Reduces network transmission overhead, accelerates distributed computing

OLAP Analytical Queries

ORC's design is specifically optimized for analytical queries:

Column pruning: Only read columns involved in queries, significantly reducing I/O
Predicate pushdown: Utilize statistics and Bloom Filters for fast data filtering
Vectorized execution: Support SIMD instructions, improving CPU utilization
Parallel reading: Stripe-level parallel processing, fully utilizing multi-core resources

Integration with Table Formats

While ORC is powerful on its own, in modern data lake architectures, it's often combined with table formats:

Integration with Table Formats.png

6. Practical Examples

Reading ORC Format Data in Doris

SELECT * FROM S3 (
    'uri' = 's3://bucket/path/to/data.orc',
    'format' = 'orc',
    's3.endpoint' = 'https://s3.us-east-1.amazonaws.com',
    's3.region' = 'us-east-1',
    's3.access_key' = 'ak',
    's3.secret_key' = 'sk'
);

Exporting Data as ORC Format in Doris

SELECT customer_id, SUM(amount) as total_amount
FROM orders
WHERE order_date >= '2024-01-01'
GROUP BY customer_id
INTO OUTFILE "s3://bucket/result_"
FORMAT AS ORC
PROPERTIES (
    's3.endpoint' = 'https://s3.us-east-1.amazonaws.com',
    's3.region' = 'us-east-1',
    's3.access_key' = 'ak',
    's3.secret_key' = 'sk',
    'orc.compress' = 'ZSTD'
);

Creating ORC Tables in Hive

CREATE TABLE sales_orc (
    order_id BIGINT,
    customer_id STRING,
    product_name STRING,
    price DECIMAL(10,2),
    quantity INT,
    order_date TIMESTAMP
)
STORED AS ORC
TBLPROPERTIES (
    'orc.compress'='ZLIB',
    'orc.bloom.filter.columns'='customer_id,product_name'
);

Enabling ACID Transaction Support

-- Enable Hive ACID transactions
SET hive.support.concurrency=true;
SET hive.enforce.bucketing=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
SET hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;

-- Create ACID-enabled ORC table
CREATE TABLE sales_acid (
    order_id BIGINT,
    customer_id STRING,
    amount DECIMAL(10,2),
    order_date DATE
)
CLUSTERED BY (order_id) INTO 10 BUCKETS
STORED AS ORC
TBLPROPERTIES ('transactional'='true');

7. Key Takeaways

Apache ORC is a columnar storage format optimized specifically for the Hadoop ecosystem
Provides excellent compression ratios and query performance
Built-in rich statistics and indexing mechanisms
Native ACID transaction support, suitable for enterprise data warehouses
Deep integration with Hive ecosystem, mature and stable
Excellent performance in batch ETL and OLAP analytical scenarios

8. FAQ

Q: What are the differences between ORC and Parquet?

A: Both are columnar formats, but they have different strengths:

ORC: Better support in Hive ecosystem, built-in ACID transactions, higher compression ratios
Parquet: Broader ecosystem, better Spark integration, stronger cross-platform compatibility

Q: Does ORC support nested data types?

A: Yes, ORC supports complex data types including:

STRUCT (structures)
ARRAY (arrays)
MAP (mappings)
UNION (union types)

Q: How to choose ORC compression algorithms?

A: Recommendations based on scenarios:

ZLIB: Balanced compression ratio and speed, default choice
Snappy: Fast compression/decompression speed, medium compression ratio
LZ4: Extremely fast decompression speed, suitable for frequently accessed data
ZSTD: Best compression ratio, suitable for archived data

Q: Are ORC files compatible across different versions?

A: ORC is backward compatible; newer versions can read older version files. However, it's recommended to:

Maintain ORC version consistency
Use schema evolution features for structural changes
Test compatibility in production environments

ORC Format

1. What is Apache ORC

2. Why Apache ORC is Needed

Row Storage vs Columnar Storage ORC Optimization

Traditional Row Storage Issues

ORC Columnar Storage Advantages

3. ORC Architecture and Core Components

File Structure Hierarchy

1. File Footer

2. Stripe

3. Column Data

4. Index Data

Component Relationship Summary:

4. Key Features & Characteristics

Efficient Columnar Storage and Compression

ACID Transaction Support

Built-in Statistics and Indexing

Schema Evolution Support

5. Use Cases / Application Scenarios

Enterprise Data Warehouses

Batch Processing ETL Scenarios

OLAP Analytical Queries

Integration with Table Formats

6. Practical Examples

Reading ORC Format Data in Doris

Exporting Data as ORC Format in Doris

Creating ORC Tables in Hive

Enabling ACID Transaction Support

7. Key Takeaways

8. FAQ

Q: What are the differences between ORC and Parquet?

Q: Does ORC support nested data types?

Q: How to choose ORC compression algorithms?

Q: Are ORC files compatible across different versions?

Apache Doris

Company

Security

Follow Us

ORC Format

1. What is Apache ORC

2. Why Apache ORC is Needed

Row Storage vs Columnar Storage ORC Optimization

Traditional Row Storage Issues

ORC Columnar Storage Advantages

3. ORC Architecture and Core Components

File Structure Hierarchy

1. File Footer

2. Stripe

3. Column Data

4. Index Data

Component Relationship Summary:

4. Key Features & Characteristics

Efficient Columnar Storage and Compression

ACID Transaction Support

Built-in Statistics and Indexing

Schema Evolution Support

5. Use Cases / Application Scenarios

Enterprise Data Warehouses

Batch Processing ETL Scenarios

OLAP Analytical Queries

Integration with Table Formats

6. Practical Examples

Reading ORC Format Data in Doris

Exporting Data as ORC Format in Doris

Creating ORC Tables in Hive

Enabling ACID Transaction Support

7. Key Takeaways

8. FAQ

Q: What are the differences between ORC and Parquet?

Q: Does ORC support nested data types?

Q: How to choose ORC compression algorithms?

Q: Are ORC files compatible across different versions?

9. Related Resources and CTA

Apache Doris

Company

Security

Follow Us