Skip to content

Conversation

@hendrikmakait
Copy link
Member

@hendrikmakait hendrikmakait commented Sep 4, 2023

Closes #8015
Supersedes #8128
Blocked by #dask/dask#10493

  • Tests added / passed
  • Passes pre-commit run --all-files

Comment on lines 87 to 95
with pa.OSFile(str(path), mode="rb") as f:
size = f.seek(0, whence=2)
f.seek(0)
while f.tell() < size:
sr = pa.RecordBatchStreamReader(f)
shard = sr.read_all()
arrs = [pa.concat_arrays(column.chunks) for column in shard.columns]
shard = pa.table(data=arrs, schema=schema)
shards.append(shard)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By interleaving disk reads and deserialization, we reduced the size of the individual buffers that get created.

while f.tell() < size:
sr = pa.RecordBatchStreamReader(f)
shard = sr.read_all()
arrs = [pa.concat_arrays(column.chunks) for column in shard.columns]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I understand, the RecordBatchStreamReader creates one buffer per record batch. On main, this is a problem when we convert the pa.Table consisting of all those batches into a pd.DataFrame. This conversion frees buffers on a per-column basis. Effectively, this means that all buffers from all record batches will not be freed until we converted the last column. To avoid this, we force a copy for each column directly after reading it with pa.concat_arrays. This way, we (should) have one buffer per column per batch.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, pa.Table.combine_chunks proceeds on a per-column basis causing a spike in temporary memory usage (see #8128).

@hendrikmakait
Copy link
Member Author

cc @phofl in case you have some thoughts on this

@github-actions
Copy link
Contributor

github-actions bot commented Sep 4, 2023

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

       21 files  ±  0         21 suites  ±0   10h 58m 4s ⏱️ + 22m 54s
  3 814 tests ±  0    3 702 ✔️  -   2     107 💤 ±0  5 +2 
36 855 runs   - 12  35 041 ✔️  - 24  1 807 💤 +8  7 +4 

For more details on these failures, see this check.

Results for commit f23c1aa. ± Comparison against base commit e350c99.

♻️ This comment has been updated with latest results.

@hendrikmakait hendrikmakait changed the title Reduce memory footprint of P2P shuffling /2 Reduce memory footprint of P2P shuffling Sep 6, 2023
@hendrikmakait hendrikmakait marked this pull request as ready for review September 6, 2023 12:03
Copy link
Member

@fjetter fjetter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a very rough review. So far LGTM but I'll want to test drive this before merging. Will come back asap



def convert_partition(data: bytes, meta: pd.DataFrame) -> pd.DataFrame:
def convert_shards(shards: list[pa.Table], meta: pd.DataFrame) -> pd.DataFrame:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(disclaimer: still in early review) I once tried to move tables around instead of bytes but that messed up the event loop. We should check this before merging

@fjetter
Copy link
Member

fjetter commented Sep 7, 2023

The increase of the minimal pyarrow version is something we have to do more carefully. This will otherwise silently cause the shuffle default method to fall back to tasks unless users are upgrading their pyarrow version. The very very least we should do is to raise a warning if pyarrow is installed but the version is too low.

Safer would likely be to raise hard in this case. For instance, if the version check fails we can raise a ValueError instead which is not handled by dask.utils.get_default_shuffle_method Edit: This won't work, the get_default_shuffle_method is weird in how it is catching all kinds of exceptions.

I doubt anybody would want to use pyarrow, shuffle a dataframe but not use p2p because the version is too old

@fjetter
Copy link
Member

fjetter commented Sep 7, 2023

A/B test would obviously be nice

@hendrikmakait
Copy link
Member Author

A/B test would obviously be nice

Running those today

Comment on lines +32 to +42
Raises a ModuleNotFoundError if pyarrow is not installed or an
ImportError if the installed version is not recent enough.
"""
# First version to introduce Table.sort_by
minversion = "7.0.0"
# First version that supports concatenating extension arrays (apache/arrow#14463)
minversion = "12.0.0"
try:
import pyarrow as pa
except ImportError:
raise RuntimeError(f"P2P shuffling requires pyarrow>={minversion}")

except ModuleNotFoundError:
raise ModuleNotFoundError(f"P2P shuffling requires pyarrow>={minversion}")
if parse(pa.__version__) < parse(minversion):
raise RuntimeError(
raise ImportError(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fjetter: Together with dask/dask#10496, get_default_shuffle_method should raise if pyarrow is outdated and choose tasks if it's not installed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(testing it manually)

batch_size = parse_bytes("1 MiB")
batch = []
shards = []
schema = pa.Schema.from_pandas(meta, preserve_index=True)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not using pyarrow_schema_dispatch here because it doesn't support preserve_index yet.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

def read_from_disk(path: Path, meta: pd.DataFrame) -> tuple[Any, int]:
import pyarrow as pa

batch_size = parse_bytes("1 MiB")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fragile and I don't really like it, but for now it seems to do the job. We will have to spend more time on performance optimization and understanding memory (de)allocation here to make this more robust.

@hendrikmakait
Copy link
Member Author

A/B test results (https://github.com/coiled/benchmarks/actions/runs/6120630936):

Runtime performance takes a minor hit on some tests

Screenshot 2023-09-08 at 18 18 50

but average

Screenshot 2023-09-08 at 18 19 05

and peak memory improve significantly.

Screenshot 2023-09-08 at 18 19 18

I'm confident that we'll get runtime down again through further performance optimization and batching on the write side.

@phofl
Copy link
Collaborator

phofl commented Sep 8, 2023

I think the hit is ok when looking at the memory improvement

@hendrikmakait
Copy link
Member Author

test_h2o.py::test_q6[5 GB (parquet+pyarrow)-p2p] is curious because it is the only one suffering a significant increase in peak memory (from 7.75 GB to 9.7 GB cluster-wide memory).

@fjetter
Copy link
Member

fjetter commented Sep 11, 2023

At the very least test_default_get is a related failure. I'll look into it

@fjetter
Copy link
Member

fjetter commented Sep 12, 2023

I'm currently running a CI test on my fork against the dask/dask sibling branch to verify this works as expected

@fjetter
Copy link
Member

fjetter commented Sep 13, 2023

There may actually be a related failure. test_restarting_during_unpack_raises_killed_worker is timing out (pytest-timeout) due to a CancelledError being raised

2023-09-12 14:18:03,520 - distributed.worker.state_machine - WARNING - Async instruction for <Task cancelled name="execute(('shuffle_p2p-f443ad86627e550c09ed5235d5a76ddc', 31))" coro=<Worker.execute() done, defined at D:\a\distributed\distributed\distributed\worker_state_machine.py:3608>> ended with CancelledError

https://github.com/fjetter/distributed/actions/runs/6160365960/job/16717162228

I have to check if #8110 is included in this (I guess it is)

@fjetter
Copy link
Member

fjetter commented Sep 13, 2023

Ok, I could track this CancelledError somewhat down... The important message is that this is not an actual computation deadlock. The above "async instruction was cancelled" msg is expected if a worker closes while a task is being executed. The state machine task is cancelled but the thread still keeps running unnoticed.

The test actually reaches a check_worker_cleanup which raises an AssertionError. The test gets then stuck during test teardown because the shuffle plugin teardown is actually locking up. Bad but not as bad as the locked up computation... still investigating why that is.

I think the CancelledError is actually a red herring since it is triggered by the test hitting a timeout while the shuffle plugin is closing.

@fjetter
Copy link
Member

fjetter commented Sep 13, 2023

I strongly suspect that test failure is unrelated but I will spend some more time trying to hunt this down... 🤞

@fjetter
Copy link
Member

fjetter commented Sep 13, 2023

Found the cause why this was blocking, see #8184

This is an unrelated fix and we should be able to proceed here

@fjetter fjetter merged commit e57d1c5 into dask:main Sep 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

P2P blows up memory

3 participants