[BUG] Pin Test Version of PyTorch to 2.1 to Resolve NCCL Error#4464
[BUG] Pin Test Version of PyTorch to 2.1 to Resolve NCCL Error#4464raydouglass merged 19 commits intorapidsai:branch-24.06from
Conversation
There was a problem hiding this comment.
Does the conda recipe need to be updated as well https://github.com/rapidsai/cugraph/blob/branch-24.06/conda/recipes/cugraph-pyg/meta.yaml#L37
Though, the dependency is not present on the wheel: https://github.com/rapidsai/cugraph/blob/branch-24.06/python/cugraph-pyg/pyproject.toml#L29-L34
Is this a runtime or test-only dependency?
|
|
@raydouglass I think I've addressed your concerns. |
tingyu66
left a comment
There was a problem hiding this comment.
LGTM, we should be able to remove this constraint once the lazy load behavior in cupy is fixed.
jameslamb
left a comment
There was a problem hiding this comment.
👋🏻 hi, I'm here because @raydouglass asked me to come take a look.
Based on this statement
tensordict v0.4 requires PyTorch 2.3+ which is incompatible with the NCCL version on our containers
I don't think this fix will be enough to protect cugraph users. They could still install tensordict==0.3.0 together with pytorch>2.3.0, couldn't they?
If the direct issue is that cugraph is incompatible with pytorch >=2.3.0, I think the more reliable fix would be to put an upper bound like pytorch >=2.0,<2.3.0 on its pytorch dependency.
I'm leaving a non-blocking "comment" review in case I've misunderstood, and this incompatibility only shows up here in CI but wouldn't affect users.
The one other comment I left is minor and also doesn't need to block the PR.
This is really a CI issue; cuGraph supports PyTorch up to v2.3 except on environments with an older NCCL version (which is the result of a cupy bug). But even when that bug gets fixed, we still need to pin the I guess we could make a test dependency of |
Co-authored-by: Bradley Dice <bdice@bradleydice.com>
| "numba>=0.57", | ||
| "numpy>=1.23,<2.0a0", | ||
| "pylibcugraphops==24.6.*", | ||
| "tensordict>=0.1.2,<0.3.1a0", |
There was a problem hiding this comment.
I didn't catch this in my first pass, but this does not really enforce an upper bound for pytorch:
tensordict 0.3.0: torch>=2.1.0
tensordict 0.3.1: torch>=2.2.1
tensordict 0.3.2: torch==2.2.2
In addition, this line now introduces pytorch as a hard dependency for cugraph-pyg, and it will pull pytorch from PyPI, as opposed to using its own index-url (as mentioned in https://pytorch.org/get-started/locally/). I am not clear about the consequence such as if it might pull a CPU only version from PyPI.
If you look at torch_geometric's pyproject.toml, they don't even include pytorch as a hard dependency. One alternative is to run version check at import time from cugraph-pyg
There was a problem hiding this comment.
Can't you just manually install the GPU version (as we do now)?
There was a problem hiding this comment.
Yes, of course you can always manually install the correct GPU version to override whatever is installed by pip install cugraph-pyg
There was a problem hiding this comment.
I know it's not ideal but otherwise we would be requiring users to manually install tensordict which would be unexpected. Right now I assume most of them do a manual install of torch anyways.
There was a problem hiding this comment.
I was just about to post something similar after reading the @alexbarghi-nv 's explanation in #4464 (comment) (thanks for that!). I saw these same version constraints looking through tensordict releases (e.g. https://github.com/pytorch/tensordict/blob/7dbb8649f13b17be973918ec3a2dae10c985b5c4/setup.py#L70)
If the core issue is "this project's CI setup is incompatible with testing against PyTorch >= 2.3.0", then yeah similar to what was suggested in that comment, I think a better immediate fix would be:
- put a pin of the form
pytorch <2.3.0in the testing dependencies - don't modify anything about the project's
tensordictdependency
That should lead the pip and conda solvers to avoid tensordict 0.4.0, because it's requirement on torch>=2.3.0 wouldn't be satisfiable.
And it'd still make it possible for cugraph user to install cugraph==24.6.0 alongside a newer nccl and newer pytorch.
tensordict's 0.4.0 release happened to be the thing that introduced pytorch >=2.3.0 here, but some other dependency could do that in the future. Putting a ceiling in CI on pytorch is the most reliable way to prevent that from happening again... and hopefully for the next release you can work out a way to test against pytorch >=2.3.0 here in CI (maybe by upgrading beyond the cupy / nccl versions that are leading to the issue you've observed).
There was a problem hiding this comment.
We still want the tensordict pin change, though. Otherwise we are locking out users on PyTorch 2.2 and 2.1 which we do not want.
There was a problem hiding this comment.
I don't think so?
The pin is currently tensordict>=0.1.2.
That's sufficiently old to be installed alongside pytorch==2.1.0.
I just tried this:
conda create --name pytorch-test--dry-run \
-c conda-forge \
-c pytorch \
'tensordict>=0.1.2' \
'pytorch==2.1.0'And got a successful solution like this:
libtorch conda-forge/linux-64::libtorch-2.1.0-cuda118_h7595d50_303
...
pytorch conda-forge/linux-64::pytorch-2.1.0-cuda118_py312hffed1d1_303
...
tensordict conda-forge/noarch::tensordict-0.1.2-pyh6074d0b_0
output of 'conda info' (click me)
active environment : None
user config file : /home/nfs/jlamb/.condarc
populated config files : /raid/jlamb/miniforge/.condarc
/home/nfs/jlamb/.condarc
conda version : 24.3.0
conda-build version : not installed
python version : 3.10.12.final.0
solver : libmamba (default)
virtual packages : __archspec=1=broadwell
__conda=24.3.0=0
__cuda=12.2=0
__glibc=2.31=0
__linux=5.4.0=0
__unix=0=0
base environment : /raid/jlamb/miniforge (writable)
conda av data dir : /raid/jlamb/miniforge/etc/conda
conda av metadata url : None
channel URLs : https://conda.anaconda.org/conda-forge/linux-64
https://conda.anaconda.org/conda-forge/noarch
package cache : /raid/jlamb/miniforge/pkgs
/home/nfs/jlamb/.conda/pkgs
envs directories : /raid/jlamb/miniforge/envs
/home/nfs/jlamb/.conda/envs
platform : linux-64
user-agent : conda/24.3.0 requests/2.31.0 CPython/3.10.12 Linux/5.4.0-182-generic ubuntu/20.04.6 glibc/2.31 solver/libmamba conda-libmamba-solver/24.1.0 libmambapy/1.5.8
UID:GID : 10349:10004
netrc file : None
offline mode : False
There was a problem hiding this comment.
Sorry, you're right. I'll revert it.
There was a problem hiding this comment.
the latest changes look good to me! Thanks for hearing me out.
|
Also @tingyu66 I should point out that we were already making PyTorch a dependency for Conda, so this just matches it on pip. |
|
If I understand correctly, If that's accurate, this is going to be difficult to build in DLFW. We don't build/install torch in the RAPIDS stages, because the DLFW Pytorch team is downstream from us. I don't know how we can both create a |
I forgot about this. In that case I guess we need to make |
|
So @trxcllnt it looks like I already used |
See rapidsaigh-4486 for adding torchdata/pydantic and rapidsaigh-4464 for introducing the torch pin due to NCCL difficulties. It is entirely possible that this still fails due to the NCCL issue!
PyTorch 2.2+ is incompatible with the NCCL version on our containers. Normally, this would not be an issue, but there is a bug in CuPy that loads the system NCCL instead of the user NCCL. This PR binds the PyTorch test dependency version to get around this issue.