Skip to content

Conversation

@ConnorBaker
Copy link
Contributor

@ConnorBaker ConnorBaker commented Feb 22, 2023

Description of changes
Things done
  • Built on platform(s)
    • x86_64-linux
    • aarch64-linux
    • x86_64-darwin
    • aarch64-darwin
  • For non-Linux: Is sandbox = true set in nix.conf? (See Nix manual)
  • Tested, as applicable:
  • Tested compilation of all packages that depend on this change using nix-shell -p nixpkgs-review --run "nixpkgs-review rev HEAD". Note: all changes have to be committed, also see nixpkgs-review usage
  • Tested basic functionality of all binary files (usually in ./result/bin/)
  • 23.05 Release Notes (or backporting 22.11 Release notes)
    • (Package updates) Added a release notes entry if the change is major or breaking
    • (Module updates) Added a release notes entry if the change is significant
    • (Module addition) Added a release notes entry if adding a new NixOS module
  • Fits CONTRIBUTING.md.

@ConnorBaker ConnorBaker force-pushed the feat/nccl-use-cudaPackages branch from cd68b53 to 27a9a13 Compare February 22, 2023 04:06
@ofborg ofborg bot requested a review from mdaiter February 22, 2023 05:15
@ofborg ofborg bot added 10.rebuild-darwin: 1-10 This PR causes between 1 and 10 packages to rebuild on Darwin. 10.rebuild-linux: 1-10 This PR causes between 1 and 10 packages to rebuild on Linux. labels Feb 22, 2023
@ConnorBaker ConnorBaker force-pushed the feat/nccl-use-cudaPackages branch from 27a9a13 to 10bc11f Compare February 25, 2023 03:18
@SomeoneSerge SomeoneSerge added the 6.topic: cuda Parallel computing platform and API label Feb 27, 2023
@ConnorBaker ConnorBaker self-assigned this Mar 10, 2023
@ConnorBaker ConnorBaker force-pushed the feat/nccl-use-cudaPackages branch from 10bc11f to 025b8f9 Compare March 10, 2023 03:59
@ConnorBaker
Copy link
Contributor Author

Waiting for #220402 to be merged prior to proceeding.

@ConnorBaker ConnorBaker force-pushed the feat/nccl-use-cudaPackages branch from 025b8f9 to f2b39d6 Compare March 13, 2023 22:40
@nixos-discourse
Copy link

This pull request has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/tweag-nix-dev-update-45/26397/1

@ConnorBaker ConnorBaker force-pushed the feat/nccl-use-cudaPackages branch from f2b39d6 to 5dc256c Compare March 18, 2023 20:07
@ConnorBaker ConnorBaker changed the title nccl: migrate to cudaPackages nccl: refactor to fix #220340 and #221895 Mar 18, 2023
@ConnorBaker ConnorBaker changed the title nccl: refactor to fix #220340 and #221895 cudaPackages.nccl: refactor to fix #220340 and #221895 Mar 18, 2023
@ConnorBaker
Copy link
Contributor Author

Result of nixpkgs-review pr 217619 --extra-nixpkgs-config '{ allowUnfree = true; cudaSupport = true; cudaForwardCompat = false; cudaCapabilities = [ "8.6" ]; }' run on x86_64-linux 1

16 packages built:
  • cudaPackages.nccl
  • cudaPackages.nccl.dev
  • python310Packages.cupy
  • python310Packages.cupy.dist
  • python310Packages.jaxlibWithCuda
  • python310Packages.jaxlibWithCuda.dist
  • python310Packages.tensorflowWithCuda
  • python310Packages.tensorflowWithCuda.dist
  • python310Packages.torchWithCuda
  • python310Packages.torchWithCuda.dev
  • python310Packages.torchWithCuda.dist
  • python310Packages.torchWithCuda.lib
  • python311Packages.cupy
  • python311Packages.cupy.dist
  • python311Packages.jaxlibWithCuda
  • python311Packages.jaxlibWithCuda.dist

@ConnorBaker ConnorBaker marked this pull request as ready for review March 19, 2023 02:23
@ConnorBaker
Copy link
Contributor Author

CC @NixOS/cuda-maintainers

Copy link
Member

@samuela samuela left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not familiar with what the runtime dependencies of nccl are, but these changes look reasonable to me. Perhaps @mdaiter or @orivej could comment?

As we move towards more of a "python-packages.nix" model, I think it might be worth collecting all cudaPackages.* packages into a single directory like python does. We have a bunch of random stuff spread out all over the codebase for no good reason except that that's just how things evolved over time. But we don't need to keep living like this. NCCL isn't really math software anyhow.

@ConnorBaker
Copy link
Contributor Author

As we move towards more of a "python-packages.nix" model, I think it might be worth collecting all cudaPackages.* packages into a single directory like python does. We have a bunch of random stuff spread out all over the codebase for no good reason except that that's just how things evolved over time. But we don't need to keep living like this. NCCL isn't really math software anyhow.

I like that idea! It would certainly make it easier to keep track of different components.

How do you envision stuff like torch or magma working? They're packages which can rely on CUDA, but (at least torch) perhaps doesn't belong in the CUDA packages directory.

As a separate point, in the same way that we can get torch with different versions of python through python3*Packages, should we likewise have cudaPackages_*.torch available? If yes, how would we handle different versions of Python? If we were to want to match Anaconda, the user should be able to specify the version of CUDA and Python for torch. Of course this can be done by using overlays and changing what cudaPackages and python3Packages points to, but that's not necessarily user friendly if it's hard to discover/arrive at. (Although maybe this is tangential -- not enough coffee yet. If it is, let me know and I can throw it in a separate issue to track/discuss further!)

@samuela
Copy link
Member

samuela commented Mar 20, 2023

How do you envision stuff like torch or magma working? They're packages which can rely on CUDA, but (at least torch) perhaps doesn't belong in the CUDA packages directory.

Good question! The line is a little fuzzy but I envision that the general policy would be that cudaPackages.* would be only for packages that are part of CUDA toolkit or related packages from NVIDIA, and all consumers of those packages (torch, magma, etc) would continue to live externally. This would more or less match the current Python package situation AFAIU it -- eg C++ packages that rely on python packages are still maintained outside of pkgs/development/python-modules.

As a separate point, in the same way that we can get torch with different versions of python through python3Packages, should we likewise have cudaPackages_.torch available? If yes, how would we handle different versions of Python? If we were to want to match Anaconda, the user should be able to specify the version of CUDA and Python for torch. Of course this can be done by using overlays and changing what cudaPackages and python3Packages points to, but that's not necessarily user friendly if it's hard to discover/arrive at. (Although maybe this is tangential -- not enough coffee yet. If it is, let me know and I can throw it in a separate issue to track/discuss further!)

I believe this may indeed be a tangential point :P Actually, I think we should focus on supporting fewer version combinations, not more. Having to maintain support for all combinations increases testing and maintenance burden. Users who are interested in using older versions can always use custom overlays or pin nixpkgs to previous commits.

@orivej
Copy link
Contributor

orivej commented Mar 24, 2023

I'm not familiar with what the runtime dependencies of nccl are, but these changes look reasonable to me. Perhaps @mdaiter or @orivej could comment?

nccl does not have any runtime dependencies.

The changes look good to me (although I'm not familiar with the concept of backendStdenv and don't understand why build dependencies have to be moved to nativeBuildInputs).

@SomeoneSerge
Copy link
Contributor

Thanks @orivej, backendStdenv is just a hacky way to refer to a nvcc-compatible gccStdenv. We had some libstdc++ mismatch issues

@ConnorBaker ConnorBaker force-pushed the feat/nccl-use-cudaPackages branch from 5dc256c to d86b6a7 Compare March 26, 2023 00:40
@ConnorBaker
Copy link
Contributor Author

Rebased, moved cuda_cccl and cuda_cudart to buildInputs from nativeBuildInputs, and changed license from bsd3 to unfreeRedistributable. Running nixpkgs-review now!

@ConnorBaker ConnorBaker force-pushed the feat/nccl-use-cudaPackages branch from d86b6a7 to 15d0b2f Compare March 26, 2023 14:28
@ConnorBaker
Copy link
Contributor Author

I let it run overnight on my i9 13900K and it still didn't finish -- hopefully this time around it'll be faster :x

@ofborg ofborg bot requested review from orivej and removed request for orivej March 26, 2023 15:56
@ConnorBaker ConnorBaker force-pushed the feat/nccl-use-cudaPackages branch from 15d0b2f to cecff7f Compare March 27, 2023 01:44
@ofborg ofborg bot requested review from orivej and removed request for orivej March 27, 2023 03:23
@nixos-discourse
Copy link

This pull request has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/tweag-nix-dev-update-46/26872/1

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's it use which for again? Can we control it w/o which?

@SomeoneSerge
Copy link
Contributor

I don't remember where we stalled, but this LGTM. Nixpkgs-review and merge?

@samuela
Copy link
Member

samuela commented Mar 31, 2023

I don't remember where we stalled, but this LGTM. Nixpkgs-review and merge?

I believe @ConnorBaker left a TODO item for himself here, but not sure what the status of that is?

@ConnorBaker ConnorBaker force-pushed the feat/nccl-use-cudaPackages branch from cecff7f to f351800 Compare April 7, 2023 02:03
@ofborg ofborg bot requested a review from orivej April 7, 2023 02:18
@ConnorBaker
Copy link
Contributor Author

Found out this was still open when I tried to update NCCL to 2.18.3.

Let me try to pick this back up :)

@ConnorBaker ConnorBaker force-pushed the feat/nccl-use-cudaPackages branch 4 times, most recently from 87bc76d to 3185b5e Compare August 22, 2023 18:30
@ConnorBaker
Copy link
Contributor Author

Rebased, updated, and added more notes about what it's fixing.

Testing with a super simple test suite built around nccl-tests: https://github.com/ConnorBaker/nix-cuda-test

nix build -L --override-input nixpkgs github:nixos/nixpkgs/pull/217619/head github:ConnorBaker/nix-cuda-test/3074628f4a0ece4928e032f9e2f2f1307b6ed22d#nccl-test-suite

Seemed to work for me, though it's just one device.

https://gist.github.com/ConnorBaker/ea3c49e23c1eaf2544fe97ae6fdd67a5

@ConnorBaker ConnorBaker force-pushed the feat/nccl-use-cudaPackages branch from 3185b5e to 00296a3 Compare August 22, 2023 18:56
The previous usage of `cudaPackages` ensured that we only ever saw packages from
the default version of `cudaPackages` Nixpkgs uses.
@ConnorBaker ConnorBaker force-pushed the feat/nccl-use-cudaPackages branch from 00296a3 to 8208602 Compare August 22, 2023 19:00
@ofborg ofborg bot added 10.rebuild-darwin: 0 This PR does not cause any packages to rebuild on Darwin. and removed 10.rebuild-darwin: 1-10 This PR causes between 1 and 10 packages to rebuild on Darwin. labels Aug 22, 2023
@ConnorBaker ConnorBaker merged commit eeefcf7 into NixOS:master Aug 22, 2023
@ConnorBaker ConnorBaker deleted the feat/nccl-use-cudaPackages branch August 22, 2023 23:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

6.topic: cuda Parallel computing platform and API 10.rebuild-darwin: 0 This PR does not cause any packages to rebuild on Darwin. 10.rebuild-linux: 1-10 This PR causes between 1 and 10 packages to rebuild on Linux.

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

cudaPackages.nccl: only ever builds with default version of cudaPackages cudaPackages.nccl: switch to autoAddOpenGLRunpathHook

5 participants