Skip to content

Conversation

@hendrikmakait
Copy link
Member

@hendrikmakait hendrikmakait commented Aug 23, 2023

Partially addresses #8015.

This PR reduces the final footprint, but it causes the memory spike worse than before during the unpack phase of the shuffle. This makes it more likely for workers to OOM.

cc @phofl

  • Tests added / passed
  • Passes pre-commit run --all-files

@hendrikmakait hendrikmakait requested a review from fjetter as a code owner August 23, 2023 17:00
Copy link
Collaborator

@phofl phofl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to our benchmarks, this actually made it worse, e.g. more likely to fail.

@hendrikmakait
Copy link
Member Author

According to our benchmarks, this actually made it worse, e.g. more likely to fail.

Sorry, meant to put this on draft while working on the other half of this.

@hendrikmakait hendrikmakait marked this pull request as draft August 23, 2023 17:04
return pa.concat_tables(
(deserialize_table(buffer) for buffer in data), promote=True
)
).combine_chunks()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one isn’t necessary, only the one above

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I'm done, it hopefully will :)

@hendrikmakait hendrikmakait changed the title Reduce memory footprint of P2P shuffle [DNM] Reduce memory footprint of P2P shuffle Aug 23, 2023
@github-actions
Copy link
Contributor

github-actions bot commented Aug 23, 2023

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

     11 files   -        10       11 suites   - 10   2h 14m 40s ⏱️ - 8h 59m 25s
3 421 tests  -      358  3 161 ✔️  -      504  245 💤 +   137    2  -   4  13 🔥 +13 
7 407 runs   - 29 112  6 625 ✔️  - 28 072  750 💤  - 1 065  19 +12  13 🔥 +13 

For more details on these failures and errors, see this check.

Results for commit aa2711f. ± Comparison against base commit 03ea2e1.

This pull request removes 360 and adds 2 tests. Note that renamed tests count towards both.
distributed.dashboard.tests.test_scheduler_bokeh ‑ test_hardware
distributed.dashboard.tests.test_scheduler_bokeh ‑ test_hardware_endpoint
distributed.dashboard.tests.test_worker_bokeh ‑ test_CommunicatingStream
distributed.dashboard.tests.test_worker_bokeh ‑ test_basic[CommunicatingTimeSeries]
distributed.dashboard.tests.test_worker_bokeh ‑ test_basic[ExecutingTimeSeries]
distributed.dashboard.tests.test_worker_bokeh ‑ test_basic[StateTable]
distributed.dashboard.tests.test_worker_bokeh ‑ test_basic[SystemMonitor]
distributed.dashboard.tests.test_worker_bokeh ‑ test_counters
distributed.dashboard.tests.test_worker_bokeh ‑ test_prometheus
distributed.dashboard.tests.test_worker_bokeh ‑ test_routes
…
distributed.cli.tests.test_dask_worker.test_listen_address_ipv6[tcp:..[ ‑ 1]:---nanny]
distributed.cli.tests.test_dask_worker.test_listen_address_ipv6[tcp:..[ ‑ 1]:---no-nanny]
This pull request removes 13 skipped tests and adds 2 skipped tests. Note that renamed tests count towards both.
distributed.diagnostics.tests.test_memray
distributed.diagnostics.tests.test_nvml ‑ test_1_visible_devices
distributed.diagnostics.tests.test_nvml ‑ test_2_visible_devices[0,1]
distributed.diagnostics.tests.test_nvml ‑ test_2_visible_devices[1,0]
distributed.diagnostics.tests.test_nvml ‑ test_gpu_metrics
distributed.diagnostics.tests.test_nvml ‑ test_gpu_monitoring_range_query
distributed.diagnostics.tests.test_nvml ‑ test_gpu_monitoring_recent
distributed.diagnostics.tests.test_nvml ‑ test_has_cuda_context
distributed.diagnostics.tests.test_nvml ‑ test_one_time
distributed.shuffle.tests.test_graph ‑ test_raise_on_custom_objects
…
distributed.cli.tests.test_dask_worker.test_listen_address_ipv6[tcp:..[ ‑ 1]:---nanny]
distributed.cli.tests.test_dask_worker.test_listen_address_ipv6[tcp:..[ ‑ 1]:---no-nanny]
This pull request skips 148 tests.
distributed.cli.tests.test_dask_scheduler ‑ test_dashboard_allowlist
distributed.cli.tests.test_dask_scheduler ‑ test_signal_handling[15]
distributed.cli.tests.test_dask_scheduler ‑ test_signal_handling[2]
distributed.cli.tests.test_dask_scheduler ‑ test_signal_handling[Signals.SIGINT]
distributed.cli.tests.test_dask_scheduler ‑ test_signal_handling[Signals.SIGTERM]
distributed.cli.tests.test_dask_scheduler ‑ test_single_executable_deprecated
distributed.cli.tests.test_dask_ssh ‑ test_version_option
distributed.cli.tests.test_dask_worker ‑ test_listen_address_ipv6[tcp://:---nanny]
distributed.cli.tests.test_dask_worker ‑ test_listen_address_ipv6[tcp://:---no-nanny]
distributed.cli.tests.test_dask_worker ‑ test_listen_address_ipv6[tcp://[::1]:---nanny]
…

♻️ This comment has been updated with latest results.

@hendrikmakait
Copy link
Member Author

Superseded by #8157

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants