Skip to content

Conversation

@eerhardt
Copy link
Contributor

Ensure 8-byte alignment on each buffer in a RecordBatch as specified in https://arrow.apache.org/docs/format/Layout.html#requirements-goals-and-non-goals

It is required to have all the contiguous memory buffers in an IPC payload aligned at 8-byte boundaries. In other words, each buffer must start at an aligned 8-byte offset. Additionally, each buffer should be padded to a multiple of 8 bytes.

/cc @pgovind @stephentoub @imback82

@wesm - If possible, can we also include this patch in the next release (0.14.1 or 0.15.0)? We hit this issue trying to update .NET for Apache Spark to the latest Arrow release - dotnet/spark#167.

public readonly ArrowBuffer DataBuffer;
public readonly int Offset;
public readonly int Length;
public readonly int Padding;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this actually need to be stored in Buffer? Seems like the call site in WriteRecordBatchInternalAsync could just compute this value instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be computed in WriteRecordBatchInternalAsync , but then computing the padding (i.e. calling RoundUpToMultipleOf8) would need to be in both places.
It is needed when creating Buffer because we need to increment the TotalLength to include the padding. That way the next Buffer's Offset is always aligned to 8 bytes.


In reply to: 302357499 [](ancestors = 302357499)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but that computation is cheap, and it looks like this increases the size of the buffer, which could be more expensive overall. Something to think about.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a quick benchmark for ArrowStreamWriter, and noticed that I can remove ~10% of the allocated memory during Write by making this struct smaller - I could also remove the Length field. The running time is not affected. So I went forward with the change, thanks for the suggestion.

Original proposal

Method BatchLength ColumnSetCount Mean Error StdDev Gen 0/1k Op Allocated Memory/Op
WriteBatch 10000 10 832.5 us 52.00 us 151.7 us - 44.38 KB
WriteBatch 10000 25 1,776.1 us 61.54 us 178.5 us - 121.09 KB
WriteBatch 1000000 10 61,126.2 us 1,162.39 us 1,141.6 us - 44.38 KB
WriteBatch 1000000 25 150,407.0 us 2,938.05 us 3,383.5 us - 121.09 KB

With new changes

Method BatchLength ColumnSetCount Mean Error StdDev Gen 0/1k Op Allocated Memory/Op
WriteBatch 10000 10 827.4 us 47.40 us 138.3 us - 40.41 KB
WriteBatch 10000 25 1,758.5 us 60.13 us 176.3 us - 105.13 KB
WriteBatch 1000000 10 61,222.3 us 1,391.92 us 1,429.4 us - 40.41 KB
WriteBatch 1000000 25 149,516.4 us 1,198.86 us 936.0 us - 105.13 KB

@eerhardt
Copy link
Contributor Author

@ursabot --help

@ursabot
Copy link

ursabot commented Jul 11, 2019

Usage: @ursabot [OPTIONS] COMMAND [ARGS]...

  Ursabot

Options:
  --help  Show this message and exit.

Commands:
  benchmark  Run the benchmark suite in comparison mode.
  build      Trigger all tests registered for this pull request.
  crossbow   Trigger crossbow builds for this pull request

@eerhardt
Copy link
Contributor Author

@ursabot build

@wesm
Copy link
Member

wesm commented Jul 11, 2019

@wesm - If possible, can we also include this patch in the next release (0.14.1 or 0.15.0)? We hit this issue trying to update .NET for Apache Spark to the latest Arrow release - dotnet/spark#167.

Yes, sounds good

Recalculate padding instead of allocating more memory.
@eerhardt
Copy link
Contributor Author

I'm not sure why the Rust leg is failing. I am pretty sure it isn't caused by my change.

Copy link

@imback82 imback82 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @eerhardt!

Copy link
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@wesm wesm closed this in 15fca3d Jul 11, 2019
@eerhardt eerhardt deleted the FixWriterPadding branch July 11, 2019 22:06
wesm pushed a commit that referenced this pull request Jul 13, 2019
Ensure 8-byte alignment on each buffer in a RecordBatch as specified in https://arrow.apache.org/docs/format/Layout.html#requirements-goals-and-non-goals

>It is required to have all the contiguous memory buffers in an IPC payload aligned at 8-byte boundaries. In other words, each buffer must start at an aligned 8-byte offset. Additionally, each buffer should be padded to a multiple of 8 bytes.

/cc @pgovind @stephentoub @imback82

@wesm - If possible, can we also include this patch in the next release (0.14.1 or 0.15.0)? We hit this issue trying to update .NET for Apache Spark to the latest Arrow release - dotnet/spark#167.

Author: Eric Erhardt <eric.erhardt@microsoft.com>

Closes #4851 from eerhardt/FixWriterPadding and squashes the following commits:

76807e9 <Eric Erhardt> PR feedback
7ecda78 <Eric Erhardt> Ensure 8-byte alignment on each buffer in a RecordBatch.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants