Skip to content

Support per-batch custom_metadata on RecordBatch (IPC Message field) #9444

@rustyconover

Description

@rustyconover

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

The Arrow IPC format supports a custom_metadata field on the Message flatbuffer envelope (Message.fbs), allowing per-batch metadata separate from schema-level metadata. Currently, the Rust RecordBatch struct has no custom_metadata field and the IPC reader/writer ignore it.

PyArrow has supported this since v11.0.0 via write_batch(batch, custom_metadata=...) and read_next_batch_with_custom_metadata(). This means IPC files written by PyArrow with per-batch metadata lose that metadata when read by arrow-rs.

Describe the solution you'd like

  1. Add a custom_metadata: HashMap<String, String> field to RecordBatch with accessor methods (custom_metadata(), custom_metadata_mut(), with_custom_metadata(), into_parts_with_custom_metadata())
  2. IPC writer: serialize custom_metadata to the Message flatbuffer when writing record batches
  3. IPC reader: extract custom_metadata from the Message at all reader call sites (FileDecoder, StreamReader, StreamDecoder)
  4. arrow-flight: extract and propagate custom_metadata in flight_data_to_arrow_batch
  5. arrow-select: preserve custom_metadata through filter_record_batch and take_record_batch
  6. Preserve metadata through slice(), project(), normalize(), with_schema(), and remove_column()

Describe alternatives you've considered

  • Storing per-batch metadata in schema-level metadata with a naming convention — this conflates two levels of metadata and doesn't match the IPC format's intent.
  • An Option<HashMap<String, String>> instead of HashMap<String, String>HashMap::new() is zero-allocation so the overhead is minimal, and Option complicates every accessor for little gain.

Additional context

  • HashMap::new() does not heap-allocate, so there is no performance concern for the default (empty metadata) case.
  • The existing into_parts() signature is unchanged for backward compatibility; a new into_parts_with_custom_metadata() is added.
  • Multi-batch merge operations (concat_batches, interleave_record_batch, BatchCoalescer) intentionally do not propagate per-batch metadata since the semantics are ambiguous when merging batches with different metadata.
  • Reuses existing metadata_to_fb (convert.rs) for writing and the KV extraction pattern for reading.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions