Improve memory footprint of P2P rechunking #7897

hendrikmakait · 2023-06-09T16:24:08Z

Preamble:

For the sake of this explanation, I will diverge from the common naming used in dask.array:

An n-dimensional Dask array like the one illustrated in the image below

is partitioned into many small n-dimensional NumPy arrays, each of which we will refer to as a partition. The partitions are created by chunking the array along each axis, where a single chunk determines the size of its corresponding partitions along this axis. In the image, we have 5 chunks along the horizontal axis, 4 chunks along the vertical axis and 20 partitions.

With $P :=$ number of partitions of the array, and $C_n :=$ number of chunks along axis n, and $N :=$ number of dimensions, it follows that

$$P = \prod_{n=1}^N C_n$$

Content

The big change in this PR is the intermediate result we store on the shuffle run to determine how to slice the input partitions in order to create the output partitions from the resulting shards. With this PR, we do not calculate for each of the input partitions, how we need to slice it into shards that belong to one output partition. Instead, we only store for each individual axis of the n-dimensional array how we need to split the chunks along that axis.

With $P_{old} :=$ number of old partitions, $P_{new} :=$ number of new partitions, this means with the previous algorithm we stored a result of size $\mathcal{O}(P_{old}* P_{new})$.

With $C_{old, n} :=$ number of old chunks along axis n, $C_{new, n} :=$ number of new chunks along axis n,

it follows that

$$ P_{old}* P_{new} = \prod_{n=1}^N C_{old, n} * C_{new, n} \in O((C_{old} * C_{new})^N)$$

where $C_{old} :=$ number of old chunks along all axes, $C_{new} :=$ number of new chunks along all axes

For the new approach, there are only $C_{old, n} + C_{new, n}$ splits per axis n. Hence, we only store a result of size

$$ \sum_{n=1}^N C_{old, n} + C_{new, n} = C_{old} + C_{new} \in \mathcal{O}(C_{old} + C_{new}) \in \mathcal{O}(P_{old} + P_{new})$$

Tests added / passed
Passes pre-commit run --all-files

cc @wence-

github-actions · 2023-06-09T18:11:44Z

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

      20 files ±    0       20 suites ±0 13h 16m 46s ⏱️ + 1h 48m 58s
  3 677 tests -     2   3 567 ✔️ +    1   103 💤 -     5   7 ❌ +  2
35 743 runs +163 34 066 ✔️ +255 1 655 💤 - 109 22 ❌ +17

For more details on these failures, see this check.

Results for commit b93e602. ± Comparison against base commit 74a1bcd.

This pull request removes 9 and adds 7 tests. Note that renamed tests count towards both.

distributed.protocol.tests.test_numpy
distributed.shuffle.tests.test_rechunk
distributed.shuffle.tests.test_rechunk ‑ test_rechunk_slicing_1
distributed.shuffle.tests.test_rechunk ‑ test_rechunk_slicing_2
distributed.shuffle.tests.test_rechunk ‑ test_rechunk_slicing_chunks_with_nonzero
distributed.shuffle.tests.test_rechunk ‑ test_rechunk_slicing_chunks_with_zero
distributed.shuffle.tests.test_rechunk ‑ test_rechunk_slicing_nan
distributed.shuffle.tests.test_rechunk ‑ test_rechunk_slicing_nan_long
distributed.shuffle.tests.test_rechunk ‑ test_rechunk_slicing_nan_single

distributed.shuffle.tests.test_rechunk ‑ test_split_axes_1
distributed.shuffle.tests.test_rechunk ‑ test_split_axes_2
distributed.shuffle.tests.test_rechunk ‑ test_split_axes_nan
distributed.shuffle.tests.test_rechunk ‑ test_split_axes_nan_long
distributed.shuffle.tests.test_rechunk ‑ test_split_axes_nan_single
distributed.shuffle.tests.test_rechunk ‑ test_split_axes_with_nonzero
distributed.shuffle.tests.test_rechunk ‑ test_split_axes_with_zero

♻️ This comment has been updated with latest results.

wence-

I think a bit of exposition on what is going on would be helpful

wence- · 2023-06-12T10:01:23Z

distributed/shuffle/_rechunk.py

-    from dask.array.rechunk import intersect_chunks
-
-    ndim = len(old)
-    intersections = intersect_chunks(old, new)


OK, so previously we would make the full cartesian product of intersections up front.

Yes, previously we calculated the mapping information from each individual input chunk to each individual output chunk. This is results in the cartesian product of the chunks on the chunked axes.

wence- · 2023-06-12T10:02:27Z

distributed/shuffle/_rechunk.py

+    from dask.array.rechunk import old_to_new

-        shard_indices = product(*(range(dim) for dim in sub_shape))
+    _old_to_new = old_to_new(old, new)


Now we "just" do this mapping from old to new.

Now we calculate the mapping per axis, which we can piece together for each inividual N-dimensional input at runtime. So this scales much slower.

distributed/shuffle/_rechunk.py

wence- · 2023-06-12T10:07:44Z

distributed/shuffle/_rechunk.py

+    for axis_id, new_axis in enumerate(_old_to_new):
+        old_axis: SplitAxis = [[] for _ in old[axis_id]]
+        for new_chunk_id, new_chunk in enumerate(new_axis):


OK, so this is O(n^2) which makes sense because it's a transpose-like operation.

wence- · 2023-06-12T10:08:02Z

distributed/shuffle/_rechunk.py

+    for axis_id, new_axis in enumerate(_old_to_new):
+        old_axis: SplitAxis = [[] for _ in old[axis_id]]
+        for new_chunk_id, new_chunk in enumerate(new_axis):
+            for new_subchunk_id, (old_chunk_id, slice) in enumerate(new_chunk):


And each new_chunk is O(1) long? Or can it be O(n) long?

Each new chunk is $O(C_{old, n})$, that is, we could have a new chunk along axis n that contains all old chunks along that axis.

wence- · 2023-06-12T10:09:13Z

distributed/shuffle/_rechunk.py

-    for new_index, new_chunk in zip(new_indices, intersections):
-        sub_shape = [len({slice[dim][0] for slice in new_chunk}) for dim in range(ndim)]
+def split_axes(old: ChunkedAxes, new: ChunkedAxes) -> SplitAxes:
+    from dask.array.rechunk import old_to_new


Can you add a docstring please?

Done, does this help?

wence- · 2023-06-12T10:10:34Z

distributed/shuffle/_rechunk.py

+                continue
+            old_chunk.sort(key=lambda subchunk: subchunk[2].start)
+        axes.append(old_axis)
+    return axes


So this data is just O(n_old * n_new) in size, rather than O(n_old^2 * n_new^2) if I understand how the previous slicing datastructure was created?

See the description for an explanation of the memory complexity.

wence- · 2023-06-12T10:49:44Z

distributed/shuffle/_worker_extension.py

+            from itertools import product
+
+            shards = product(
+                *(axis[i] for axis, i in zip(self.split_axes, input_partition))


Is it obvious why the input partition is the same length as the split axes?

I think almost none about the rechunking code is obvious, so please raise wherever you're missing information. It took my way too long to understand what's going on in task-bask rechunking.

split_axes contains the splits across all N dimension/axes and input_partition is an N-dimensional index.

I've renamed input_partition to partition_id in case that makes things clearer.

wence- · 2023-06-12T10:53:34Z

distributed/shuffle/_worker_extension.py

+        rec_cat_arg[tuple(index)] = shard
    del data
    del file
    arrs = rec_cat_arg.tolist()


TODO: it feels naively like it is very wasteful to make objects of everything, convert to a nested list and then concatenate. Why is it not possible to just allocate the right size of thing up front and insert directly?

We only know the true size of the output chunk when we have loaded all its shards thanks to the chunks of size np.nan along some dimension. We could do some additional math to determine the output size once we've finished reading the file. This would allow us to avoid rec_cat_arg and we could insert the elements of shards directly. I'm not sure how much of a performance problem that really is though. I'd rather leave performance optimization for a follow-up.

hendrikmakait · 2023-06-12T11:41:12Z

I think a bit of exposition on what is going on would be helpful

I'll write up an in-depth explanation of the changes made here.

hendrikmakait · 2023-06-13T12:44:48Z

distributed/shuffle/_worker_extension.py

-                    (id, pickle.dumps((id.shard_index, data[nslice])))
+            from itertools import product
+
+            shards = product(


Since we only compute the splits per axis, we create the full cartesian product here again to determine each of the n-dimensional shards we have to slice from the input and send to the output.

hendrikmakait · 2023-06-13T13:13:39Z

@wence-: I've added a sketch of the changes to the memory complexity to the description.

hendrikmakait · 2023-06-13T14:50:22Z

distributed/shuffle/tests/test_rechunk.py

        pytest.param(
            da.ones(shape=(1000, 10), chunks=(5, 10)),
            (None, 5),
-            marks=pytest.mark.skip(reason="distributed#7757"),
        ),
        pytest.param(
            da.ones(shape=(1000, 10), chunks=(5, 10)),
            {1: 5},
-            marks=pytest.mark.skip(reason="distributed#7757"),
        ),
        pytest.param(
            da.ones(shape=(1000, 10), chunks=(5, 10)),
            (None, (5, 5)),
-            marks=pytest.mark.skip(reason="distributed#7757"),
        ),


It looks like these test cases still fail on CI, I'm investigating.

Locally, these take ~7s, roughly double the time of task-based shuffling. Given the shape of the rechunk it's not surprising that task-based performs better here. I think we may want to leave these cases out or adjust their size.

hendrikmakait · 2023-06-14T10:12:40Z

chunks are partitions and shards are the sliced partitions we are submitting?

Yes, if you have some ideas on clarifying naming, I'd appreciate it. We also have chunks as partitions/n-dimensional arrays and the chunks per axis (which are the information on the size of the chunks along this axis) .

EDIT: Now partitions are partitions and shards are sub-partitions of partitions created by slicing them. That is, a partition can be re-created by concatenating its shards.

hendrikmakait · 2023-06-14T11:23:59Z

I've run an A/B test and the results show significant improvements in average and peak memory:

as well as mild performance improvements on runtime for 2 tests:

fjetter · 2023-06-14T11:24:11Z

distributed/shuffle/_rechunk.py

-    slicing = defaultdict(list)
+SplitChunk: TypeAlias = list[Split]
+SplitAxis: TypeAlias = list[SplitChunk]
+SplitAxes: TypeAlias = list[SplitAxis]


I don't have any better proposal right now but the difference between Axis and Axes doesn't help readability

SplitAxes only pops up once so I guess this is not a big deal

fjetter · 2023-06-14T11:26:41Z

distributed/shuffle/_worker_extension.py

-                    (id, pickle.dumps((id.shard_index, data[nslice])))
+            from itertools import product
+
+            shards = product(


nit: I believe you are using splits and shards interchangeably. If so, I'd prefer sticking to one

Good catch, I've only introduced splits very recently to the chunks/chunks issue, I may have missed some cases.

Does ndsplits help convey the concept that they're not truly 1-d splits nor actual shards of data?

fjetter · 2023-06-14T11:31:53Z

distributed/shuffle/tests/test_rechunk.py

+            [Split(0, 0, slice(0, 20, None))],
+            [Split(0, 1, slice(0, 20, None))],
+            [Split(0, 2, slice(0, 18, None)), Split(1, 0, slice(18, 20, None))],
+            [Split(1, 1, slice(0, 2, None)), Split(2, 0, slice(2, 20, None))],
+            [Split(2, 1, slice(0, 2, None)), Split(3, 0, slice(2, 20, None))],


off-topic: something like {avg|max|median}(len(...)) might be interesting metrics for heuristics that try to distinguish P2P vs tasks for single stage rechunks (if there is actually a use case for this)

fjetter · 2023-06-14T11:55:43Z

I only skimmed the algorithm but from what I can tell everything's in order. I would appreciate a more thorough look from @wence-

Thanks for the thorough explanation!

Regarding benchmarks, results look great! I'd be interested to see what happens with test_rechunk_out_of_memory now. This is out of scope for this PR but likely worth looking at briefly.

wence- · 2023-06-14T15:19:45Z

I would appreciate a more thorough look from @wence-

I am out for the rest of the week and will look on Monday.

hendrikmakait · 2023-06-14T17:00:14Z

I'd be interested to see what happens with test_rechunk_out_of_memory now. This is out of scope for this PR but likely worth looking at briefly.

It finally runs with P2P. Its average (peak) memory is around 1/3 (1/2) of task-based shuffling, but it's still about 3x slower. I haven't done any profiling on this, so there's a myriad of reasons for the runtime difference.

https://github.com/coiled/benchmarks/actions/runs/5267999800

fjetter · 2023-06-15T12:20:14Z

It finally runs with P2P. Its average (peak) memory is around 1/3 (1/2) of task-based shuffling, but it's still about 3x slower. I haven't done any profiling on this, so there's a myriad of reasons for the runtime difference.

Thanks. We should look into this eventually but that's out of scope for this

wence-

Thanks Hendrik, I have a few questions about whether or not there is sparsity to exploit in the chunk mappings.

wence- · 2023-06-19T13:38:52Z

distributed/shuffle/_rechunk.py

+    #: Index of the new output chunk to which this split belongs.
+    chunk_index: int
+
+    #: Index of the split within the list of splits that are concatenated
+    #: to create the new chunk.
+    split_index: int
+
+    #: Slice of the input chunk.
+    slice: slice


OK, so the idea is that you're treating everything by just looking at the 1-D "cuts" of the axes.

So on an axis, each output chunk consists of some number of slices of input chunks.

An output chunk $C_{new,i}$ consists of $\bigoplus_{j=0}{}^N C_{old,j}[S^i_j]$ where $\bigoplus$ is concatenation, $N$ is the number of input chunks, and $S^i_j$ is the slice of the $j$ th input chunk that corresponds to the $i$ th output chunk.

How does this data structure record "sparseness" in the input chunks? AIUI, it is only a contiguous range of input chunks that can correspond to an output chunk, so most slices $S^i_j$ will be empty. We record the slice, but not which input chunk we're taking from.

This is recorded implicitly in SplitChunk. The SplitChunk at position j contains all slices belonging to input chunk j. The SplitChunk at position j in SplitAxis contains a Split for output chunk i IFF input chunk j contributes a non-empty slice to i.

(There is some special casing going on if chunk i is of length 0, in that case there exists one Split of some input chunk. That split contains an empty slice.)

I can walk you through some of the test cases, those should illustrate the data structure adequately.

wence- · 2023-06-19T13:57:53Z

distributed/shuffle/_rechunk.py

+    _old_to_new = old_to_new(old, new)
+
+    axes = []
+    for axis_index, new_axis in enumerate(_old_to_new):


So I think old_to_new produces "dense" output, in that every output chunk is hypothetically made up of some piece of every input chunk (so if there are $N$ input chunks and $P$ output chunks this is $N*P$ big). However, most of the time, the input piece is empty (indicated by an empty slice), so this data structure only needs to be $O(P)$ large I think.

This only produces data for overlaps, so the data structure should be $\mathcal{O}(N + P) = O(M)$ with M being the number of overlaps between input and output chunks.

Ah ok, so this is sparse.

wence- · 2023-06-19T13:58:22Z

distributed/shuffle/_rechunk.py

+    for axis_index, new_axis in enumerate(_old_to_new):
+        old_axis: SplitAxis = [[] for _ in old[axis_index]]
+        for new_chunk_index, new_chunk in enumerate(new_axis):
+            for split_index, (old_chunk_index, slice) in enumerate(new_chunk):


If we don't want to change old_to_new we could prune the slices here with:

if slice.start == slice.end: continue

wence- · 2023-06-19T14:00:16Z

distributed/shuffle/_rechunk.py

+                old_axis[old_chunk_index].append(
+                    Split(new_chunk_index, split_index, slice)
+                )


OK, so this is how maintain the relation between the input and output chunks. This is a "scatter-like" data structure. Each old_chunk_index scatters some slice of itself to every new_chunk_index.

wence- · 2023-06-19T14:09:39Z

distributed/shuffle/_rechunk.py

+                continue
+            old_chunk.sort(key=lambda subchunk: subchunk[2].start)
+        axes.append(old_axis)
+    return axes


wence- · 2023-06-19T14:12:20Z

distributed/shuffle/_rechunk.py

+                    Split(new_chunk_index, split_index, slice)
+                )
+        for old_chunk in old_axis:
+            old_chunk.sort(key=lambda split: split.slice.start)


Does old_to_new not produce slices in sorted order?

Good point, will have to check again.

wence- · 2023-06-19T14:37:23Z

distributed/shuffle/_worker_extension.py

+            ndsplits = product(
+                *(axis[i] for axis, i in zip(self.split_axes, partition_id))
+            )


So here, for each nD input piece, we construct the set of pD output chunks it scatters to.

Again, I think IIUC that this is a dense representation, but it is actually mostly sparse (since most of the input boxes will not contribute to most of the output boxes).

FWIW, the dimensionality does not change, so both input and output are n-dimensional. See other comments wrt. sparseness.

wence- · 2023-06-19T14:51:21Z

distributed/shuffle/_worker_extension.py

+            )
+
+            for ndsplit in ndsplits:
+                chunk_index, shard_index, ndslice = zip(*ndsplit)


So ndsplit is a list of Split tuples, and this transposes, so we get a tuple that identifies the output chunk, the "shard" (still not quite sure here), and an nD slice of the input data (corresponding to the input chunk).

But this seems to presume that the indexing of the output chunks is nD like the indexing of the input chunks. But I don't think that is true?

Why do you think it's not true?

Consider the case of a 2-dim array with size (10, 10) and a single input partition of size (10, 10). This input partition has the 2-dim index (0, 0). Any way of rechunking this to smaller pieces would also result in a 2-dim index.

I thought that rechunking also allowed reshaping, but that was wrong, I believe?

Indeed, rechunking does not allow reshaping. Admittedly, it looks like it's not documented anywhere but in some tests. We validate that chunks adhere to the shape here:
https://github.com/dask/dask/blob/499f4055da707fa76d06a2b79f408f124eee4723/dask/array/core.py#L3126-L3129

wence- · 2023-06-19T14:51:36Z

distributed/shuffle/_worker_extension.py

+
+            for ndsplit in ndsplits:
+                chunk_index, shard_index, ndslice = zip(*ndsplit)
+                id = ArrayRechunkShardID(chunk_index, shard_index)


Again, I suspect pruning empty slices will be beneficial here:

if any(s.start == s.end for s in ndslice): continue

wence- · 2023-06-19T15:05:08Z

distributed/shuffle/_rechunk.py

+    Returns
+    -------
+    SplitAxes
+        Splits along each axis that determine how to slice the input chunks to create


I think Splits here is a technical term (meaning the Split object)?

Terminology is hard:

By split, I mean a split object, which is essentially a slice with its index in the output. Do you have suggestions on how to make this clearer?

"list of :class:Splits ..." ? Which should linkify things?

We should explicitly mention the sparsity of the output here.

hendrikmakait · 2023-06-19T15:56:09Z

@wence-: Thanks for reviewing. Concerning your questions, a high-bandwidth conversation might help us sort these out. Let me know if you'd like to have one.

wence- · 2023-06-19T16:25:15Z

@wence-: Thanks for reviewing. Concerning your questions, a high-bandwidth conversation might help us sort these out. Let me know if you'd like to have one.

Thanks, I think I have...

wence-

Looks good. As discussed I'll open a followup PR adding some module-level documentation (which may result in internal renamings, but hopefully no substantive changes).

fjetter

Thank you @hendrikmakait and @wence- !

hendrikmakait added 3 commits June 9, 2023 17:19

Pre-compute per axis

beb6725

Compute on the fly

3651cdc

Naming

de8c91a

wence- reviewed Jun 12, 2023

View reviewed changes

Remove unnecessary code

2a34a07

hendrikmakait added 8 commits June 12, 2023 13:58

Merge branch 'main' into compute-on-the-fly

670f707

Merge branch 'main' into compute-on-the-fly

3a2fe72

Merge branch 'main' into compute-on-the-fly

a69250d

Add Split class

c84a70e

[skip-caching]

671b448

Docstring

90de20b

Minor

a69a536

Naming

c16ac75

hendrikmakait commented Jun 13, 2023

View reviewed changes

[skip-caching]

4c3488c

hendrikmakait commented Jun 13, 2023

View reviewed changes

fjetter reviewed Jun 14, 2023

View reviewed changes

hendrikmakait added 2 commits June 14, 2023 13:34

ND-prefix

cff7f68

minor

b93e602

hendrikmakait requested a review from wence- June 14, 2023 11:39

hendrikmakait marked this pull request as ready for review June 14, 2023 11:39

hendrikmakait added the needs review Needs review from a contributor. label Jun 14, 2023

hendrikmakait added the shuffle label Jun 14, 2023

wence- reviewed Jun 19, 2023

View reviewed changes

hendrikmakait mentioned this pull request Jun 19, 2023

Document that dask.array.rechunk expects the output shape to match the input shape dask/dask#10361

Closed

wence- approved these changes Jun 20, 2023

View reviewed changes

fjetter approved these changes Jun 20, 2023

View reviewed changes

fjetter merged commit cf97a7c into dask:main Jun 20, 2023

hendrikmakait mentioned this pull request Jun 20, 2023

Fix test_rechunk_with_{fully|partially}_unknown_dimension on CI #7934

Merged

2 tasks

Uh oh!

Improve memory footprint of P2P rechunking #7897

Improve memory footprint of P2P rechunking #7897

Uh oh!

Conversation

hendrikmakait commented Jun 9, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jun 9, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Unit Test Results

Uh oh!

wence- left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hendrikmakait commented Jun 12, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hendrikmakait commented Jun 13, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hendrikmakait commented Jun 14, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hendrikmakait commented Jun 14, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fjetter commented Jun 14, 2023

Uh oh!

wence- commented Jun 14, 2023

hendrikmakait commented Jun 9, 2023 •

edited

Loading

github-actions bot commented Jun 9, 2023 •

edited

Loading

hendrikmakait commented Jun 14, 2023 •

edited

Loading

hendrikmakait commented Jun 14, 2023 •

edited

Loading