Skip to content

[BUG]: colossalai check -i failed using pip from https://www.colossalai.org/download/ #2402

@better629

Description

@better629

🐛 Describe the bug

install colossallai from pip install colossalai==0.2.0+torch1.12cu11.3 -f https://release.colossalai.org successed, and import successed.

But colossalai check -i failed, and colossalai run xx also failed.

logs:

(colos) [root@xxx hybrid_parallel]# colossalai check -i
Using /root/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
Traceback (most recent call last):
  File "/root/anaconda3/envs/colos/lib/python3.9/site-packages/colossalai/kernel/__init__.py", line 4, in <module>
    from colossalai._C import fused_optim
ImportError: /lib64/libm.so.6: version `GLIBC_2.29' not found (required by /root/anaconda3/envs/colos/lib/python3.9/site-packages/colossalai/_C/fused_optim.cpython-39-x86_64-linux-gnu.so)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/root/anaconda3/envs/colos/bin/colossalai", line 5, in <module>
    from colossalai.cli import cli
  File "/root/anaconda3/envs/colos/lib/python3.9/site-packages/colossalai/__init__.py", line 1, in <module>
    from .initialize import (
  File "/root/anaconda3/envs/colos/lib/python3.9/site-packages/colossalai/initialize.py", line 23, in <module>
    from colossalai.engine.schedule import NonPipelineSchedule, PipelineSchedule, InterleavedPipelineSchedule, get_tensor_shape
    fused_optim = FusedOptimBuilder().load()
  File "/root/anaconda3/envs/colos/lib/python3.9/site-packages/colossalai/kernel/op_builder/builder.py", line 77, in load
    op_module = load(name=self.name,
  File "/root/anaconda3/envs/colos/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1202, in load
  File "/root/anaconda3/envs/colos/lib/python3.9/site-packages/torch/utils/_cpp_extension_versioner.py", line 15, in hash_source_files
    with open(filename) as file:
FileNotFoundError: [Errno 2] No such file or directory: '/root/anaconda3/envs/colos/lib/python3.9/site-packages/colossalai/kernel/cuda_native/csrc/colossal_C_frontend.cpp'

(colos) [root@xxx hybrid_parallel]# strings /lib64/libm.so.6 | grep GLIBC
GLIBC_2.2.5
GLIBC_2.4
GLIBC_2.15
GLIBC_PRIVATE
GLIBC_2.15

So, we need to install glibc2.19, ref to https://blog.csdn.net/m0_37201243/article/details/123641552
After installed glibc2.19, colossalai check -i successed.

#### Installation Report ####

Colossal-AI version: 0.2.0
----------------------------
PyTorch Version: 1.12.0
PyTorch Version required by Colossal-AI: 1.12
PyTorch version match: ✓
----------------------------
System CUDA Version: 11.8
CUDA Version required by PyTorch: 11.3
CUDA Version required by Colossal-AI: 11.3
CUDA Version Match: x
----------------------------
CUDA Extension: ✓

but run hybrid_parallel still failed,

(colos) [colos@xxx hybrid_parallel]# colossalai run --nproc_per_node 2 train.py --config config.py
Error: failed to run torchrun --nproc_per_node=2 --nnodes=1 --node_rank=0 --rdzv_backend=c10d --rdzv_endpoint=127.0.0.1:29500 --rdzv_id=colossalai-default-job train.py --config config.py on 127.0.0.1

Environment

NVIDIA-SMI 520.61.05 Driver Version: 520.61.05 CUDA Version: 11.8
conda env has installed cudatoolkit = 11.3

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions