Skip to content

Fix ReceivedChunks::spill() throwing on control messages#975

Open
Matt711 wants to merge 4 commits intorapidsai:mainfrom
Matt711:bug/recieved-chunks-spilling
Open

Fix ReceivedChunks::spill() throwing on control messages#975
Matt711 wants to merge 4 commits intorapidsai:mainfrom
Matt711:bug/recieved-chunks-spilling

Conversation

@Matt711
Copy link
Copy Markdown
Contributor

@Matt711 Matt711 commented Apr 17, 2026

ReceivedChunks::spill() iterated all chunks in the postbox and called chunk.data_memory_type() as the left operand of ||, so C++ evaluated it before the size == 0 guard could short-circuit. Control messages (finish tokens) have no data buffer by design (data_ == nullptr), so data_memory_type() hits RAPIDSMPF_EXPECTS(data_, "data buffer is not set") and fails.

The fix is to reorder the condition to check size == 0 and !is_data_buffer_set() first, so data_memory_type() is only called when there's actually a buffer to inspect.

More context: xref rapidsai/cudf#21750

Details

This originally came up when running TPC-DS Q67 with --executor streaming --runtime rapidsmpf --spill-to-pinned-memory --n-workers 1. The query was running under memory pressure (that's the point of --spill-to-pinned-memory). During the shuffle phase, partition_and_pack called br->reserve_device_memory_and_spill() internally, which triggered the spill manager to invoke all registered spill callbacks. This includes the one registered by ShufflerAsync. That callback called ReceivedChunks::spill(), which iterated over chunks in the postbox. By that point, finish tokens (control messages) had already arrived in the postbox, and calling data_memory_type() on one threw because data_ was null.

@Matt711 Matt711 requested a review from a team as a code owner April 17, 2026 02:15
@Matt711 Matt711 added bug Something isn't working non-breaking Introduces a non-breaking change labels Apr 17, 2026
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Apr 17, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@Matt711
Copy link
Copy Markdown
Contributor Author

Matt711 commented Apr 17, 2026

/ok to test bda7ee3

@Matt711 Matt711 force-pushed the bug/recieved-chunks-spilling branch from bda7ee3 to 214cc0f Compare April 17, 2026 02:21
@Matt711
Copy link
Copy Markdown
Contributor Author

Matt711 commented Apr 17, 2026

/ok to test 7606a43

Copy link
Copy Markdown
Member

@madsbk madsbk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Matt711, good catch!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working non-breaking Introduces a non-breaking change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants