Skip to content

Conversation

@fjetter
Copy link
Member

@fjetter fjetter commented Jun 16, 2023

This is currently just a working theory and I don't have a proper reproducer, yet.

However, I observed some suboptimal scheduling behavior when two shuffles where running back-to-back, e.g.

df1.merge(df2).groupby(...).agg(..., split_out=2, shuffle="p2p")

the suboptimal behavior is that the transfer tasks of the second shuffle are not necessarily prioritized. However, they basically act as a RMM sink and should likely have strict prioritization over other foo if we want to schedule memory optimal. A similar argument holds for the shuffle output task which literally reads data from disk into memory and therefore acts as a memory producer.

We were already able to track down a significant portion of the root task queuing impact down to improper ordering (see dask/dask#9995 and #7526 (comment)) s.t. queuing de-facto deprioritizes root tasks.
I believe that the shuffle tasks should be classified as root tasks in this scheduling paradigm but teaching this to our heuristic seems cumbersome / impossible. Instead, we can just literally (de-)prioritize the tasks accordingly.
I need to follow up with a decent measurement to solidify this theory. For context, I stumbled over this scheduling behavior while playing with coiled/benchmarks#883 but couldn't get it to succeed, yet, for a couple of other reasons.

cc @hendrikmakait

@github-actions
Copy link
Contributor

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

       20 files  ±0         20 suites  ±0   11h 20m 20s ⏱️ - 23m 8s
  3 681 tests ±0    3 572 ✔️ ±0     108 💤 ±0  1 +1 
35 600 runs  ±0  33 835 ✔️ +2  1 764 💤  - 1  1 ±0 

For more details on these failures, see this check.

Results for commit 13d1011. ± Comparison against base commit 4a0c489.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants