Skip to content

Conversation

@lintingbin
Copy link
Contributor

@lintingbin lintingbin commented Jan 22, 2026

Summary

This PR implements functionality to force rewrite Avro format files during table optimization, enabling a high-performance write strategy for data ingestion workloads.

Fixes #4057

Motivation

The Performance Trade-off

Different file formats have different performance characteristics:

  • Avro (row-based):

    • Excellent write performance and high throughput
    • ✅ Fast data ingestion
    • ❌ Less efficient for analytical queries
  • Parquet/ORC (columnar):

    • ✅ Excellent read performance for analytics
    • ✅ Better compression
    • ❌ Slower writes compared to Avro

The Solution

This PR enables a best-of-both-worlds strategy:

  1. Fast Ingestion: Write data using Avro format for maximum throughput
  2. Automatic Conversion: Optimization process rewrites Avro files to Parquet/ORC
  3. Optimal Read Performance: All data eventually available in columnar format for efficient queries

This is particularly valuable for:

  • High-throughput streaming ingestion pipelines
  • Real-time data collection scenarios
  • Systems that need to optimize both write and read performance

Changes

This PR includes the following changes:

1. Configuration Support

TableProperties.java & TableConfigurations.java

  • Added SELF_OPTIMIZING_REWRITE_ALL_AVRO property (default: false)
  • Added validation logic to ignore the setting when table's default format is already Avro
  • Integrated into OptimizingConfig parsing

2. Core Logic

CommonPartitionEvaluator.java

  • Added needRewriteAvroFile flag to track Avro file presence
  • Updated addFragmentFile() to detect Avro files
  • Modified fileShouldFullOptimizing() to always rewrite Avro files when feature enabled
  • Updated fileShouldRewrite() to prioritize Avro files for rewriting
  • Enhanced isMinorNecessary() to trigger optimization when Avro files are present

IcebergPartitionPlan.java

  • Updated task validation logic to prevent skipping single Avro file optimization

ContentFiles.java

  • Added isAvroFile() utility method to identify Avro format files

OptimizingConfig.java

  • Added rewriteAllAvro configuration field with getters/setters

3. Documentation

docs/user-guides/configurations.md

  • Added comprehensive documentation for the new configuration option
  • Explained the use case: high-throughput writes with Avro, optimized reads with Parquet/ORC

4. Tests

MixedTablePlanTestBase.java

  • Added appendAvroDataFile() helper method for test data generation

TestIcebergPartitionPlan.java & TestUnkeyedPartitionPlan.java

  • Added comprehensive test coverage for Avro file rewriting
  • Tests cover fragment files, undersized segment files, and target-size-reached files
  • Validates behavior when feature is enabled/disabled
  • Ensures proper handling when table's default format is Avro

Testing

  • ✅ Verified that Avro files are correctly identified
  • ✅ Confirmed that optimization is triggered when Avro files are present and feature is enabled
  • ✅ Tested that Avro files are always included in rewrite operations
  • ✅ Validated that feature is properly ignored when table format is Avro
  • ✅ Added unit tests for all scenarios (fragment, undersized, target-size files)

Checklist

  • Code changes are complete
  • Changes maintain backward compatibility (feature is opt-in, default: false)
  • Code follows project conventions
  • Documentation updated
  • Comprehensive test coverage added

This commit adds the ability to rewrite all Avro files during table optimization.
When enabled via self-optimizing.rewrite-all-avro=true, all Avro format files
will be rewritten to the default file format (Parquet/ORC) during optimization.

This is particularly useful for high-throughput write scenarios where Avro format
provides better write performance, while this feature ensures read performance
is maintained by converting files to columnar formats.
@lintingbin lintingbin force-pushed the feature/force-rewrite-avro-files branch from 9b3ccab to d5b92f0 Compare January 22, 2026 07:29
@github-actions github-actions bot added type:docs Improvements or additions to documentation module:ams-server Ams server module module:common labels Jan 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

module:ams-server Ams server module module:common type:docs Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Force Rewrite Avro Files During Optimization

1 participant