-
Notifications
You must be signed in to change notification settings - Fork 377
Open
Description
Description
Currently, the optimizer may skip rewriting Avro format files even when optimization is triggered. This can lead to suboptimal table performance when Avro files exist in the table.
Problem
Avro files have different characteristics compared to columnar formats (like Parquet or ORC):
- Write Performance: Avro format provides significantly better write performance and higher throughput compared to columnar formats, making it ideal for high-speed data ingestion scenarios
- Read Performance: However, Avro's row-based format is less efficient for analytical queries compared to columnar formats
- Format Consistency: Mixed file formats in a table can complicate maintenance and optimization
Proposed Solution
Add a configurable option self-optimizing.rewrite-all-avro to force rewrite all Avro files during optimization. This enables a write-optimized ingestion strategy:
- Use Avro format for fast, high-throughput data ingestion
- Automatically convert Avro files to Parquet/ORC during optimization for better read performance
- Maintain optimal performance for both write and read workloads
Implementation Details
The changes include:
- Add a
needRewriteAvroFileflag inCommonPartitionEvaluatorto track if any Avro files exist - Check file format using
ContentFiles.isAvroFile()method - Always mark Avro files for rewriting when the feature is enabled
- Update partition evaluation to consider Avro file presence as a necessary condition for optimization
- Add configuration property with proper validation (ignored when table's default format is Avro)
Benefits
- High-throughput ingestion: Leverage Avro's superior write performance for data ingestion
- Optimal read performance: Ensure all data is eventually converted to columnar format for efficient queries
- Best of both worlds: Maximize both write and read performance through format conversion
- Flexible configuration: Enable/disable based on workload characteristics
- Table health: Maintains consistency in file formats across the table
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels