-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-6314: [C#] Implement IPC message format alignment changes, provide backwards compatibility and "legacy" option to emit old message format #5280
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Thanks @eerhardt! |
pgovind
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
Does this patch account for the EOS change under discussion on the mailing list? |
Yes, partially. The reader handles the first 4 bytes being On the writer side, the C# API doesn't have a "Write end of stream" function. But maybe now it is important to add that method, since the EOS can take 2 formats depending on the options. I can add that new API, if you think it makes sense. |
|
Yes, that sounds good to me! |
|
@pgovind @chutchinson - FYI I implemented a new method on Please take another look. |
4f9b887 to
0352456
Compare
|
@wesm - can you explain what happened here? I didn't change the |
…ide backwards compatibility and "legacy" option to emit old message format
Remove WriteFooterAsync on ArrowFileWriter and instead use WriteEndAsync from the base class. This basically renames WriteFooterAsync to WriteEndAsync for the file writer. Fix a bug in ArrowFileWriter - we now write the EOS signal before the footer, which is specified in https://arrow.apache.org/docs/format/IPC.html.
8bee8e7 to
231e90c
Compare
|
@wesm - this is ready to be merged as soon as CI completes. |
| return; | ||
| } | ||
|
|
||
| messageLength = BitUtility.ReadInt32(lengthBuffer); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This appears to perform an unaligned read from the specified buffer assuming a native byte ordering; shouldn't this be using BinaryPrimitives.ReadInt32LittleEndian?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code was refactored from the original code, which you can see here:
arrow/csharp/src/Apache.Arrow/Ipc/ArrowStreamReader.cs
Lines 62 to 66 in 044b418
| // Get Length of record batch for message header. | |
| lengthBuffer = Buffers.Rent(4); | |
| bytesRead += await BaseStream.ReadAsync(lengthBuffer, 0, 4, cancellationToken); | |
| var messageLength = BitConverter.ToInt32(lengthBuffer, 0); |
Originally, we always had a byte[] and would call BitConverter.ToInt32. However, with the changes to allow for Memory and Span, I needed to make the same call, only with a Span instead of byte[]. This API exists in .NET, but it is not available in netstandard. So I needed to copy the little bit of code out into the BitUtility class.
https://source.dot.net/#System.Private.CoreLib/shared/System/BitConverter.cs,269
You can see the BitConverter.ToInt32(byte[]) does the same operation.
return Unsafe.ReadUnaligned<int>(ref value[startIndex]);From what I can tell, the C++ implementation does the same thing:
(master branch)
arrow/cpp/src/arrow/ipc/message.cc
Line 240 in c3a6878
| int32_t flatbuffer_size = *reinterpret_cast<const int32_t*>(buffer->data()); |
(ARROW-6313-flatbuffer-alignment branch)
arrow/cpp/src/arrow/util/ubsan.h
Lines 54 to 58 in 2d63975
| inline typename std::enable_if<std::is_integral<T>::value, T>::type SafeLoadAs( | |
| const uint8_t* unaligned) { | |
| typename std::remove_const<T>::type ret; | |
| std::memcpy(&ret, unaligned, sizeof(T)); | |
| return ret; |
I was never sure on this, and the spec doesn't 100% specify if these length numbers are big or little endian, or machine dependent. So that's why I've never changed this code, and left it doing what it has always been doing.
https://arrow.apache.org/docs/format/Layout.html#byte-order-endianness
The Arrow format is little endian by default. The Schema metadata has an endianness field indicating endianness of RecordBatches. Typically this is the endianness of the system where the RecordBatch was generated.
Having the endianness inside of the schema doesn't help when you need to know what endian the schema length is in, in order to read the schema itself.
I see we are always writing little-endian numbers for these lengths, so maybe changing it here can be justified that way.
Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe since this issue has existed in this code since its inception, it would be best to open a JIRA issue for this.
https://issues.apache.org/jira/browse/ARROW-6553 - "[C#] Decide how to read message lengths - little-endian or machine dependent"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eerhardt I'll continue the discussion in that JIRA issue; I interpreted the "little-endian by default" section to mean that the IPC protocol is always little-endian, but that array primitives have a byte order corresponding to the (optional) schema metadata value. If the protocol specification does not specify byte ordering or a mechanism for determining byte ordering, I would think to view that as an oversight; however, it can also just mean the C++ code is presently non-compliant or does not support such endian-awareness.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The C++ implementation is not big-endian compliant. Even finding environments to do big endian testing nowadays is a major challenge.
|
|
||
| namespace Apache.Arrow.Ipc | ||
| { | ||
| public class IpcOptions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do you feel about naming this ArrowFileWriterOptions or ArrowWriterOptions? In consideration of future options that may or may not be IPC-specific.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was following the naming/design as the Java and C++ implementations. But if you prefer a different name here for C#, I can change it to ArrowWriterOptions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name seems a little narrow and disconnected to me, but I do understand there's value in replicating the name used in the other implementation(s) - it can assist in auditing, documentation, platform interoperability, etc.
Given that, I suppose we should leave it the way it is.
|
@chutchinson - are you OK with these changes for 0.15? We can change the length reading to little-endian with https://issues.apache.org/jira/browse/ARROW-6553, if necessary. |
|
Merging this in. There's still some time left to make more changes if needed. Thanks all for being conscientious about these changes |
…ide backwards compatibility and "legacy" option to emit old message format Porting the fix for ARROW-6314 to the C# library. /cc @chutchinson @pgovind Closes #5280 from eerhardt/ARROW-6313-csharp and squashes the following commits: 231e90c <Eric Erhardt> Implement WriteEndAsync on ArrowStreamWriter to write the EOS signal. c494a4b <Eric Erhardt> ARROW-6314: Implement IPC message format alignment changes, provide backwards compatibility and "legacy" option to emit old message format Authored-by: Eric Erhardt <eric.erhardt@microsoft.com> Signed-off-by: Wes McKinney <wesm+git@apache.org>
…ide backwards compatibility and "legacy" option to emit old message format Porting the fix for ARROW-6314 to the C# library. /cc @chutchinson @pgovind Closes apache#5280 from eerhardt/ARROW-6313-csharp and squashes the following commits: 231e90c <Eric Erhardt> Implement WriteEndAsync on ArrowStreamWriter to write the EOS signal. c494a4b <Eric Erhardt> ARROW-6314: Implement IPC message format alignment changes, provide backwards compatibility and "legacy" option to emit old message format Authored-by: Eric Erhardt <eric.erhardt@microsoft.com> Signed-off-by: Wes McKinney <wesm+git@apache.org>
Porting the fix for ARROW-6314 to the C# library.
/cc @chutchinson @pgovind