Skip to content

[BUG]: TORCH_PATCH contains invalid literal '0a0' | nvidia ngc pytorch Docker #3675

@LukasIAO

Description

@LukasIAO

🐛 Describe the bug

Installing colossalai with CUDA_EXT=1 pip install . fails when installing from source in docker using an nvidia ngc image.

Step 6/7 : RUN cd ColossalAI &&     CUDA_EXT=1 pip install -v --no-cache-dir  . &&     pip install -r requirements/requirements.txt
 ---> Running in 52a064371203
Using pip 21.2.4 from /opt/conda/lib/python3.8/site-packages/pip (python 3.8)
Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Processing /workspace/repositories/ColossalAI
  DEPRECATION: A future pip version will change local packages to be built in-place without first copying to a temporary directory. We recommend you use --use-feature=in-tree-build to test your packages with this new behavior before it becomes the default.
   pip 21.3 will remove support for this functionality. You can find discussion regarding this at https://github.com/pypa/pip/issues/7555.
    Running command python setup.py egg_info
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-req-build-cky862ul/setup.py", line 126, in <module>
        environment_check_for_cuda_extension_build()
      File "/tmp/pip-req-build-cky862ul/setup.py", line 53, in environment_check_for_cuda_extension_build
        check_pytorch_version(MIN_PYTORCH_VERSION_MAJOR, MIN_PYTORCH_VERSION_MINOR)
      File "/tmp/pip-req-build-cky862ul/op_builder/utils.py", line 130, in check_pytorch_version
        torch_major, torch_minor, _ = get_pytorch_version()
      File "/tmp/pip-req-build-cky862ul/op_builder/utils.py", line 114, in get_pytorch_version
        TORCH_PATCH = int(torch_version.split('.')[2])
    ValueError: invalid literal for int() with base 10: '0a0'
    False
WARNING: Discarding file:///workspace/repositories/ColossalAI. Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.
ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.
The command '/bin/sh -c cd ColossalAI &&     CUDA_EXT=1 pip install -v --no-cache-dir  . &&     pip install -r requirements/requirements.txt' returned a non-zero code: 1

Dockerfile:

FROM nvcr.io/nvidia/pytorch:21.11-py3

RUN mkdir -p repositories
WORKDIR repositories

RUN apt-get update && apt-get update

#cloning colossal repo
#main is currently bugged, hence the specifig tag
RUN git clone https://github.com/hpcaitech/ColossalAI.git --branch v0.2.7

RUN cd ColossalAI && \
    CUDA_EXT=1 pip install -v --no-cache-dir  . && \
    pip install -r requirements/requirements.txt

The installation works if CUDA_EXT=1 is omitted. colossal check -i shows:

#### Installation Report ####

------------ Environment ------------
Colossal-AI version: 0.2.7
PyTorch version: 1.11.0a0
System CUDA version: 11.5
CUDA version required by PyTorch: 11.5

Note:
1. The table above checks the versions of the libraries/tools in the current environment
2. If the System CUDA version is N/A, you can set the CUDA_HOME environment variable to locate it
3. If the CUDA version required by PyTorch is N/A, you probably did not install a CUDA-compatible PyTorch. This value is given by torch.version.cuda and you can go to https://pytorch.org/get-started/locally/ to download the correct version.

------------ CUDA Extensions AOT Compilation ------------
Found AOT CUDA Extension: x
PyTorch version used for AOT compilation: N/A
CUDA version used for AOT compilation: N/A

Note:
1. AOT (ahead-of-time) compilation of the CUDA kernels occurs during installation when the environment varialbe CUDA_EXT=1 is set
2. If AOT compilation is not enabled, stay calm as the CUDA kernels can still be built during runtime

------------ Compatibility ------------
PyTorch version match: N/A
System and PyTorch CUDA version match: ✓
System and Colossal-AI CUDA version match: N/A

Note:
1. The table above checks the version compatibility of the libraries/tools in the current environment
   - PyTorch version mismatch: whether the PyTorch version in the current environment is compatible with the PyTorch version used for AOT compilation
   - System and PyTorch CUDA version match: whether the CUDA version in the current environment is compatible with the CUDA version required by PyTorch
   - System and Colossal-AI CUDA version match: whether the CUDA version in the current environment is compatible with the CUDA version used for AOT compilation

Note the PyTorch version: 1.11.0a0, which seems to cause the error. This problem is consistent across newer ngc-pytorch images, such as 22.09, 23.03, and 23.04 as they all share the 0a0 suffix which, to my understanding, denotes an alpha branch.

This may be an issue with the nvidia image, but would it be possible to make the colossal installation compatible with such pytorch versions? If anyone has had similar issues and managed to find a workaround, I would be grateful for some tips!

Environment

Docker ENV:
Nvidia-ngc nvcr.io/nvidia/pytorch:21.11-py3 Docker image
Colossal-AI version: 0.2.7
PyTorch version: 1.11.0a0
System CUDA version: 11.5
CUDA version required by PyTorch: 11.5

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions