Skip to content
This repository was archived by the owner on Nov 17, 2023. It is now read-only.
This repository was archived by the owner on Nov 17, 2023. It is now read-only.

[v1.x] CD cu102 110 test stage [Check failed: device_count_ > 0 (-1 vs. 0) : GPU usage requires at least 1 GPU] #19948

@Zha0q1

Description

@Zha0q1

This issue started to happen after we switched to the new ami for restricted-mxnetlinux-gpu, which has the newer nvidia driver 460.32.03.

https://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/restricted-mxnet-cd%2Fmxnet-cd-release-job-1.x/detail/mxnet-cd-release-job-1.x/1553/pipeline

This happened to cu102 and cu110, but not cu 100 or 101. I was able to reproduce by basically building the same image as in the cd pipeline, using the same g3 instance and the same ami

docker build -f docker/Dockerfile.build.ubuntu_gpu_cu102 --build-arg USER_ID=1001 --build-arg GROUP_ID=1001 --cache-from 021742426385.dkr.ecr.us-west-2.amazonaws.com/mxnet-ci:build.ubuntu_gpu_cu102-81dcd5660530 -t 021742426385.dkr.ecr.us-west-2.amazonaws.com/mxnet-ci:build.ubuntu_gpu_cu102-81dcd5660530 docker

after entering the docker container I did

pip3 install mxnet-cu102

I was able to reproduce the exact error by running

>>> import mxnet
>>> import mxnet as mx
>>> ctx = mx.gpu(0)
>>> a = mx.nd.ones((100), ctx=ctx)
[02:55:03] src/base.cc:49: GPU context requested, but no GPUs found.
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.7/dist-packages/mxnet/ndarray/ndarray.py", line 3295, in ones
    return _internal._ones(shape=shape, ctx=ctx, dtype=dtype, **kwargs)
  File "<string>", line 39, in _ones
  File "/usr/local/lib/python3.7/dist-packages/mxnet/_ctypes/ndarray.py", line 91, in _imperative_invoke
    ctypes.byref(out_stypes)))
  File "/usr/local/lib/python3.7/dist-packages/mxnet/base.py", line 246, in check_call
    raise get_last_ffi_error()
mxnet.base.MXNetError: Traceback (most recent call last):
  File "src/engine/threaded_engine.cc", line 331
MXNetError: Check failed: device_count_ > 0 (-1 vs. 0) : GPU usage requires at least 1 GPU

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions