Skip to content

Conversation

@eerhardt
Copy link
Contributor

@eerhardt eerhardt commented Sep 4, 2019

Porting the fix for ARROW-6314 to the C# library.

/cc @chutchinson @pgovind

@wesm
Copy link
Member

wesm commented Sep 4, 2019

Thanks @eerhardt!

Copy link

@pgovind pgovind left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@wesm
Copy link
Member

wesm commented Sep 11, 2019

Does this patch account for the EOS change under discussion on the mailing list?

@eerhardt
Copy link
Contributor Author

Does this patch account for the EOS change under discussion on the mailing list?

Yes, partially. The reader handles the first 4 bytes being 0 to mean stop reading after 4 bytes. And if it encounters a continuation token 0xFFFFFFFF and then a 0 it stops.

On the writer side, the C# API doesn't have a "Write end of stream" function. But maybe now it is important to add that method, since the EOS can take 2 formats depending on the options.

I can add that new API, if you think it makes sense.

@wesm
Copy link
Member

wesm commented Sep 11, 2019

Yes, that sounds good to me!

@eerhardt
Copy link
Contributor Author

@pgovind @chutchinson - FYI I implemented a new method on ArrowStreamWriter - WriteEndAsync(), which will write the EOS signal. I also moved ArrowFileWriter's WriteFooterAsync to instead just use WriteEndAsync from the base class. This means anyone using ArrowFileWriter will have a slight change - WriteFooterAsync => WriteEndAsync.

Please take another look.

@wesm wesm force-pushed the ARROW-6313-flatbuffer-alignment branch from 4f9b887 to 0352456 Compare September 11, 2019 22:07
@eerhardt
Copy link
Contributor Author

@wesm - can you explain what happened here? I didn't change the read_write_test.cc file, yet I am getting a merge conflict.

…ide backwards compatibility and "legacy" option to emit old message format
Remove WriteFooterAsync on ArrowFileWriter and instead use WriteEndAsync from the base class. This basically renames WriteFooterAsync to WriteEndAsync for the file writer.

Fix a bug in ArrowFileWriter - we now write the EOS signal before the footer, which is specified in https://arrow.apache.org/docs/format/IPC.html.
@eerhardt
Copy link
Contributor Author

@wesm - this is ready to be merged as soon as CI completes.

return;
}

messageLength = BitUtility.ReadInt32(lengthBuffer);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This appears to perform an unaligned read from the specified buffer assuming a native byte ordering; shouldn't this be using BinaryPrimitives.ReadInt32LittleEndian?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code was refactored from the original code, which you can see here:

// Get Length of record batch for message header.
lengthBuffer = Buffers.Rent(4);
bytesRead += await BaseStream.ReadAsync(lengthBuffer, 0, 4, cancellationToken);
var messageLength = BitConverter.ToInt32(lengthBuffer, 0);

Originally, we always had a byte[] and would call BitConverter.ToInt32. However, with the changes to allow for Memory and Span, I needed to make the same call, only with a Span instead of byte[]. This API exists in .NET, but it is not available in netstandard. So I needed to copy the little bit of code out into the BitUtility class.

https://source.dot.net/#System.Private.CoreLib/shared/System/BitConverter.cs,269

You can see the BitConverter.ToInt32(byte[]) does the same operation.

return Unsafe.ReadUnaligned<int>(ref value[startIndex]);

From what I can tell, the C++ implementation does the same thing:

(master branch)

int32_t flatbuffer_size = *reinterpret_cast<const int32_t*>(buffer->data());

(ARROW-6313-flatbuffer-alignment branch)

inline typename std::enable_if<std::is_integral<T>::value, T>::type SafeLoadAs(
const uint8_t* unaligned) {
typename std::remove_const<T>::type ret;
std::memcpy(&ret, unaligned, sizeof(T));
return ret;

I was never sure on this, and the spec doesn't 100% specify if these length numbers are big or little endian, or machine dependent. So that's why I've never changed this code, and left it doing what it has always been doing.

https://arrow.apache.org/docs/format/Layout.html#byte-order-endianness

The Arrow format is little endian by default. The Schema metadata has an endianness field indicating endianness of RecordBatches. Typically this is the endianness of the system where the RecordBatch was generated.

Having the endianness inside of the schema doesn't help when you need to know what endian the schema length is in, in order to read the schema itself.

I see we are always writing little-endian numbers for these lengths, so maybe changing it here can be justified that way.

Thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe since this issue has existed in this code since its inception, it would be best to open a JIRA issue for this.

https://issues.apache.org/jira/browse/ARROW-6553 - "[C#] Decide how to read message lengths - little-endian or machine dependent"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eerhardt I'll continue the discussion in that JIRA issue; I interpreted the "little-endian by default" section to mean that the IPC protocol is always little-endian, but that array primitives have a byte order corresponding to the (optional) schema metadata value. If the protocol specification does not specify byte ordering or a mechanism for determining byte ordering, I would think to view that as an oversight; however, it can also just mean the C++ code is presently non-compliant or does not support such endian-awareness.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The C++ implementation is not big-endian compliant. Even finding environments to do big endian testing nowadays is a major challenge.


namespace Apache.Arrow.Ipc
{
public class IpcOptions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you feel about naming this ArrowFileWriterOptions or ArrowWriterOptions? In consideration of future options that may or may not be IPC-specific.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was following the naming/design as the Java and C++ implementations. But if you prefer a different name here for C#, I can change it to ArrowWriterOptions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name seems a little narrow and disconnected to me, but I do understand there's value in replicating the name used in the other implementation(s) - it can assist in auditing, documentation, platform interoperability, etc.

Given that, I suppose we should leave it the way it is.

@eerhardt
Copy link
Contributor Author

@chutchinson - are you OK with these changes for 0.15? We can change the length reading to little-endian with https://issues.apache.org/jira/browse/ARROW-6553, if necessary.

@wesm
Copy link
Member

wesm commented Sep 12, 2019

Merging this in. There's still some time left to make more changes if needed. Thanks all for being conscientious about these changes

@wesm wesm closed this Sep 12, 2019
@eerhardt eerhardt deleted the ARROW-6313-csharp branch September 12, 2019 23:39
wesm pushed a commit that referenced this pull request Sep 13, 2019
…ide backwards compatibility and "legacy" option to emit old message format

Porting the fix for ARROW-6314 to the C# library.

/cc @chutchinson @pgovind

Closes #5280 from eerhardt/ARROW-6313-csharp and squashes the following commits:

231e90c <Eric Erhardt> Implement WriteEndAsync on ArrowStreamWriter to write the EOS signal.
c494a4b <Eric Erhardt> ARROW-6314:  Implement IPC message format alignment changes, provide backwards compatibility and "legacy" option to emit old message format

Authored-by: Eric Erhardt <eric.erhardt@microsoft.com>
Signed-off-by: Wes McKinney <wesm+git@apache.org>
pprudhvi pushed a commit to pprudhvi/arrow that referenced this pull request Sep 16, 2019
…ide backwards compatibility and "legacy" option to emit old message format

Porting the fix for ARROW-6314 to the C# library.

/cc @chutchinson @pgovind

Closes apache#5280 from eerhardt/ARROW-6313-csharp and squashes the following commits:

231e90c <Eric Erhardt> Implement WriteEndAsync on ArrowStreamWriter to write the EOS signal.
c494a4b <Eric Erhardt> ARROW-6314:  Implement IPC message format alignment changes, provide backwards compatibility and "legacy" option to emit old message format

Authored-by: Eric Erhardt <eric.erhardt@microsoft.com>
Signed-off-by: Wes McKinney <wesm+git@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants