Skip to content

Conversation

@amirgon
Copy link
Contributor

@amirgon amirgon commented Jul 2, 2024

When mode is UCS_THREAD_MODE_SERIALIZED, UCX crash due to mpool corruption.
This happens when buffer is deallocated on a different thread. In such case two threads access UCX memory pool simultaneously.

See discussion on UCX forum: openucx/ucx#9987

When mode is `UCS_THREAD_MODE_SERIALIZED`, UCX crash due to mpool corruption.   
This happens when buffer is deallocated on a different thread. In such case two threads access UCX memory pool simultaneously.

See discussion on UCX forum: openucx/ucx#9987
@lidavidm
Copy link
Member

lidavidm commented Jul 2, 2024

Thanks! However, this code is scheduled for deprecation/removal soon in favor of the Disassociated IPC proposal

@kou
Copy link
Member

kou commented Jul 2, 2024

This is not a MINOR change. See https://github.com/apache/arrow/blob/main/CONTRIBUTING.md#Minor-Fixes for our MINOR definition.
If we need this change, could you open a new issue for this?

@amirgon
Copy link
Contributor Author

amirgon commented Jul 3, 2024

Thanks! However, this code is scheduled for deprecation/removal soon in favor of the Disassociated IPC proposal

@lidavidm - Could you please refer me to the "Disassociated IPC" proposal?
Does it mean UCX support is planned to be removed? What is the timeline for this?

If we need this change, could you open a new issue for this?

@kou - Sure, opened #43130

@amirgon amirgon changed the title MINOR: [ArrowFlight] Fix UCS thread mode [C++][ArrowFlight] Crash due to UCS thread mode Jul 3, 2024
@lidavidm
Copy link
Member

lidavidm commented Jul 3, 2024

@amirgon
Copy link
Contributor Author

amirgon commented Jul 3, 2024

https://arrow.apache.org/docs/dev/format/DissociatedIPC.html

Looks like the reference implementation for Disassociated IPC might suffer from the same issue:

https://github.com/apache/arrow-experiments/blob/188c4e5ff4bda08319d4520e380d736c36b9ee48/dissociated-ipc/ucx_client.cc#L55-L60

arrow::Result<std::unique_ptr<utils::Connection>> UcxClient::CreateConn() {
  ucp_worker_params_t worker_params;
  std::memset(&worker_params, 0, sizeof(worker_params));
  worker_params.field_mask =
      UCP_WORKER_PARAM_FIELD_THREAD_MODE | UCP_WORKER_PARAM_FIELD_FLAGS;
  worker_params.thread_mode = UCS_THREAD_MODE_SERIALIZED;

@lidavidm
Copy link
Member

lidavidm commented Jul 3, 2024

It was partially based on this experiment so I'm not surprised.

I think for now I'm OK merging this, but please be aware it will go away in the near future.

@amirgon
Copy link
Contributor Author

amirgon commented Jul 3, 2024

I think for now I'm OK merging this, but please be aware it will go away in the near future.

Sure, thanks.
Consider fixing it also on arrow-experiments.

@raulcd
Copy link
Member

raulcd commented Jul 3, 2024

As per the deprecation, see this email: https://lists.apache.org/thread/g89x2y6pvlq6gyf0d1jnxfl2onsrkyt8
I plan to work on a PR removing support soon.

@raulcd raulcd changed the title [C++][ArrowFlight] Crash due to UCS thread mode GH-43130: [C++][ArrowFlight] Crash due to UCS thread mode Jul 3, 2024
@github-actions
Copy link

github-actions bot commented Jul 3, 2024

⚠️ GitHub issue #43130 has been automatically assigned in GitHub to PR creator.

@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting review Awaiting review labels Jul 5, 2024
@lidavidm lidavidm merged commit 5a28e18 into apache:main Jul 5, 2024
@lidavidm lidavidm removed the awaiting merge Awaiting merge label Jul 5, 2024
@conbench-apache-arrow
Copy link

After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 5a28e18.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 28 possible false positives for unstable benchmarks that are known to sometimes produce them.

amirgon added a commit to amirgon/arrow-experiments that referenced this pull request Jul 7, 2024
Fix reference example based on the same fix on Arrow.

See: apache/arrow#43120
ianmcook pushed a commit to apache/arrow-experiments that referenced this pull request Jul 8, 2024
Fix reference example based on the same fix on Arrow.

See: apache/arrow#43120
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants