feat: add custom_metadata support to RecordBatch with IPC read/write#9445
feat: add custom_metadata support to RecordBatch with IPC read/write#9445rustyconover wants to merge 2 commits intoapache:mainfrom
custom_metadata support to RecordBatch with IPC read/write#9445Conversation
1ebc16a to
31227eb
Compare
31227eb to
9390ec4
Compare
…ustom_metadata Pin arrow-rs to rustyconover/arrow-rs#feat/recordbatch-custom-metadata (apache/arrow-rs#9445) and rewrite vgi-rpc/src/wire.rs as a thin wrapper around arrow_ipc::reader::StreamReader / writer::StreamWriter. Per-batch metadata now travels on RecordBatch directly via with_custom_metadata() / custom_metadata(); the Metadata alias becomes HashMap<String, String> and the ReadBatch wrapper is gone. relax_nullability flips with_skip_validation(true) on the inner reader since upstream validates before our schema rewrap. Also bundles in-progress conformance worker, http, and arrow_type changes that were already pending on the branch. Conformance: 723/723 across pipe/subprocess/http/unix/externalize. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
alamb
left a comment
There was a problem hiding this comment.
Thanks @rustyconover -- I see the custom_metadata field on the IPC messages and that makes sense to expose somehow
I am less sure about adding a new field to RecordBatch -- mostly as I am not sure about the implications of doing so (though your point that an empty HashMap has no allocations is a good one)
I mostly am thinking about our experience in other libraries trying to handle custom metadata on Fields where many kernels / processing don't preserve the metadata and it has been quite tough
I fear the same thing would happen to this field -- basically that it would not be used by most libraries but they would all pay the size cost on every RecordBatch
|
I wonder if @tustvold @jhorstmann @viirya or @kylebarron have any thoughts on this matter (adding custom metadata to every RecordBatch) |
|
Hi @alamb I've patched arrow-go and arrow-js. And I have others with working patches into arrow-java. So its mostly about connectivity for me. |
|
Hi @alamb, thanks for your review. I think the CI passed. Is there more you'd like me to do? Being a new contributor to This is going to be the primary user to start: https://github.com/Query-farm/vgi-rpc-rust |
|
Thanks @rustyconover -- there isn't anything I think you need to do What i think is next needed is some buy in from other maintainers / stakeholders that changing |
Which issue does this PR close?
custom_metadataonRecordBatch(IPC Message field) #9444.What changes are included in this PR?
Add per-batch
custom_metadatatoRecordBatch, matching thecustom_metadatafield on the IPCMessageflatbuffer envelope. This allows attaching per-batch metadata separate from schema-level metadata, bringing parity with PyArrow'swrite_batch(custom_metadata=...)API (available since PyArrow v11.0.0).Changes:
custom_metadata: HashMap<String, String>field toRecordBatchwithcustom_metadata(),custom_metadata_mut(),with_custom_metadata(), andinto_parts_with_custom_metadata()accessorsflight_data_to_arrow_batchfilter_record_batchandtake_record_batchslice(),project(),normalize(),with_schema(), andremove_column()Are these changes tested?
Yes there are tests in the PR.
Are there any user-facing changes?
There are no breaking changes.
Written with AI assistance; all changes reviewed by the author.