You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Jan 22, 2024. It is now read-only.
After a random amount of time (it could be hours or days) the GPUs become unavailable inside all the running containers and nvidia-smi returns "Failed to initialize NVML: Unknown Error".
A restart of all the containers fixes the issue and the GPUs return available.
Outside the containers the GPUs are still working correctly.
I tried searching in the open/closed issues but I could not find any solution.
2. Steps to reproduce the issue
All the containers are run with docker run --gpus all -it tensorflow/tensorflow:latest-gpu /bin/bash
Linux wds-co-ml 5.15.0-43-generic NVIDIA/nvidia-docker#46-Ubuntu SMP Tue Jul 12 10:30:17 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Driver information from nvidia-smi -a
==============NVSMI LOG==============
Timestamp : Wed Aug 31 12:42:55 2022
Driver Version : 515.48.07
CUDA Version : 11.7
Attached GPUs : 2
GPU 00000000:01:00.0
Product Name : NVIDIA GeForce RTX 2080 Ti
Product Brand : GeForce
Product Architecture : Turing
Display Mode : Disabled
Display Active : Disabled
Persistence Mode : Disabled
MIG Mode
Current : N/A
Pending : N/A
Accounting Mode : Disabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : N/A
GPU UUID : GPU-13fd0930-06c3-5975-8720-72c72ee7a823
Minor Number : 0
VBIOS Version : 90.02.0B.00.C7
MultiGPU Board : No
Board ID : 0x100
GPU Part Number : N/A
Module ID : 0
Inforom Version
Image Version : G001.0000.02.04
OEM Object : 1.1
ECC Object : N/A
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GSP Firmware Version : N/A
GPU Virtualization Mode
Virtualization Mode : None
Host VGPU Mode : N/A
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0x01
Device : 0x00
Domain : 0x0000
Device Id : 0x1E0710DE
Bus Id : 00000000:01:00.0
Sub System Id : 0x150319DA
GPU Link Info
PCIe Generation
Max : 3
Current : 1
Link Width
Max : 16x
Current : 8x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 0 KB/s
Rx Throughput : 0 KB/s
Fan Speed : 0 %
Performance State : P8
Clocks Throttle Reasons
Idle : Not Active
Applications Clocks Setting : Not Active
SW Power Cap : Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
FB Memory Usage
Total : 11264 MiB
Reserved : 244 MiB
Used : 1 MiB
Free : 11018 MiB
BAR1 Memory Usage
Total : 256 MiB
Used : 3 MiB
Free : 253 MiB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
Ecc Mode
Current : N/A
Pending : N/A
ECC Errors
Volatile
SRAM Correctable : N/A
SRAM Uncorrectable : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A
Aggregate
SRAM Correctable : N/A
SRAM Uncorrectable : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending Page Blacklist : N/A
Remapped Rows : N/A
Temperature
GPU Current Temp : 30 C
GPU Shutdown Temp : 94 C
GPU Slowdown Temp : 91 C
GPU Max Operating Temp : 89 C
GPU Target Temperature : 84 C
Memory Current Temp : N/A
Memory Max Operating Temp : N/A
Power Readings
Power Management : Supported
Power Draw : 20.87 W
Power Limit : 260.00 W
Default Power Limit : 260.00 W
Enforced Power Limit : 260.00 W
Min Power Limit : 100.00 W
Max Power Limit : 300.00 W
Clocks
Graphics : 300 MHz
SM : 300 MHz
Memory : 405 MHz
Video : 540 MHz
Applications Clocks
Graphics : N/A
Memory : N/A
Default Applications Clocks
Graphics : N/A
Memory : N/A
Max Clocks
Graphics : 2160 MHz
SM : 2160 MHz
Memory : 7000 MHz
Video : 1950 MHz
Max Customer Boost Clocks
Graphics : N/A
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Voltage
Graphics : N/A
Processes : None
GPU 00000000:02:00.0
Product Name : NVIDIA GeForce RTX 2080 Ti
Product Brand : GeForce
Product Architecture : Turing
Display Mode : Disabled
Display Active : Disabled
Persistence Mode : Disabled
MIG Mode
Current : N/A
Pending : N/A
Accounting Mode : Disabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : N/A
GPU UUID : GPU-a76d37d7-5ed0-58d9-6087-b18fee984570
Minor Number : 1
VBIOS Version : 90.02.17.00.58
MultiGPU Board : No
Board ID : 0x200
GPU Part Number : N/A
Module ID : 0
Inforom Version
Image Version : G001.0000.02.04
OEM Object : 1.1
ECC Object : N/A
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GSP Firmware Version : N/A
GPU Virtualization Mode
Virtualization Mode : None
Host VGPU Mode : N/A
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0x02
Device : 0x00
Domain : 0x0000
Device Id : 0x1E0710DE
Bus Id : 00000000:02:00.0
Sub System Id : 0x150319DA
GPU Link Info
PCIe Generation
Max : 3
Current : 1
Link Width
Max : 16x
Current : 8x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 0 KB/s
Rx Throughput : 0 KB/s
Fan Speed : 35 %
Performance State : P8
Clocks Throttle Reasons
Idle : Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
FB Memory Usage
Total : 11264 MiB
Reserved : 244 MiB
Used : 1 MiB
Free : 11018 MiB
BAR1 Memory Usage
Total : 256 MiB
Used : 27 MiB
Free : 229 MiB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
Ecc Mode
Current : N/A
Pending : N/A
ECC Errors
Volatile
SRAM Correctable : N/A
SRAM Uncorrectable : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A
Aggregate
SRAM Correctable : N/A
SRAM Uncorrectable : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending Page Blacklist : N/A
Remapped Rows : N/A
Temperature
GPU Current Temp : 28 C
GPU Shutdown Temp : 94 C
GPU Slowdown Temp : 91 C
GPU Max Operating Temp : 89 C
GPU Target Temperature : 84 C
Memory Current Temp : N/A
Memory Max Operating Temp : N/A
Power Readings
Power Management : Supported
Power Draw : 6.66 W
Power Limit : 260.00 W
Default Power Limit : 260.00 W
Enforced Power Limit : 260.00 W
Min Power Limit : 100.00 W
Max Power Limit : 300.00 W
Clocks
Graphics : 300 MHz
SM : 300 MHz
Memory : 405 MHz
Video : 540 MHz
Applications Clocks
Graphics : N/A
Memory : N/A
Default Applications Clocks
Graphics : N/A
Memory : N/A
Max Clocks
Graphics : 2160 MHz
SM : 2160 MHz
Memory : 7000 MHz
Video : 1950 MHz
Max Customer Boost Clocks
Graphics : N/A
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Voltage
Graphics : N/A
Processes : None
Docker version from docker version
Client: Docker Engine - Community
Version: 20.10.17
API version: 1.41
Go version: go1.17.11
Git commit: 100c701
Built: Mon Jun 6 23:02:46 2022
OS/Arch: linux/amd64
Context: default
Experimental: true
Server: Docker Engine - Community
Engine:
Version: 20.10.17
API version: 1.41 (minimum version 1.12)
Go version: go1.17.11
Git commit: a89b842
Built: Mon Jun 6 23:00:51 2022
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.6.6
GitCommit: 10c12954828e7c7c9b6e0ea9b0c02b01407d3ae1
runc:
Version: 1.1.2
GitCommit: v1.1.2-0-ga916309
docker-init:
Version: 0.19.0
GitCommit: de40ad0
NVIDIA packages version from dpkg -l '*nvidia*'orrpm -qa '*nvidia*'
ii libnvidia-cfg1-515:amd64 515.48.07-0ubuntu0.22.04.2 amd64 NVIDIA binary OpenGL/GLX configuration library
ii libnvidia-common-515 515.48.07-0ubuntu0.22.04.2 all Shared files used by the NVIDIA libraries
ii libnvidia-compute-515:amd64 515.48.07-0ubuntu0.22.04.2 amd64 NVIDIA libcompute package
ii libnvidia-compute-515:i386 515.48.07-0ubuntu0.22.04.2 i386 NVIDIA libcompute package
ii libnvidia-container-tools 1.10.0-1 amd64 NVIDIA container runtime library (command-line tools)
ii libnvidia-container1:amd64 1.10.0-1 amd64 NVIDIA container runtime library
ii libnvidia-decode-515:amd64 515.48.07-0ubuntu0.22.04.2 amd64 NVIDIA Video Decoding runtime libraries
ii libnvidia-decode-515:i386 515.48.07-0ubuntu0.22.04.2 i386 NVIDIA Video Decoding runtime libraries
ii libnvidia-egl-wayland1:amd64 1:1.1.9-1.1 amd64 Wayland EGL External Platform library -- shared library
ii libnvidia-encode-515:amd64 515.48.07-0ubuntu0.22.04.2 amd64 NVENC Video Encoding runtime library
ii libnvidia-encode-515:i386 515.48.07-0ubuntu0.22.04.2 i386 NVENC Video Encoding runtime library
ii libnvidia-extra-515:amd64 515.48.07-0ubuntu0.22.04.2 amd64 Extra libraries for the NVIDIA driver
ii libnvidia-fbc1-515:amd64 515.48.07-0ubuntu0.22.04.2 amd64 NVIDIA OpenGL-based Framebuffer Capture runtime library
ii libnvidia-fbc1-515:i386 515.48.07-0ubuntu0.22.04.2 i386 NVIDIA OpenGL-based Framebuffer Capture runtime library
ii libnvidia-gl-515:amd64 515.48.07-0ubuntu0.22.04.2 amd64 NVIDIA OpenGL/GLX/EGL/GLES GLVND libraries and Vulkan ICD
ii libnvidia-gl-515:i386 515.48.07-0ubuntu0.22.04.2 i386 NVIDIA OpenGL/GLX/EGL/GLES GLVND libraries and Vulkan ICD
ii linux-modules-nvidia-515-5.15.0-43-generic 5.15.0-43.46 amd64 Linux kernel nvidia modules for version 5.15.0-43
ii linux-modules-nvidia-515-generic-hwe-22.04 5.15.0-43.46 amd64 Extra drivers for nvidia-515 for the generic-hwe-22.04 flavour
ii linux-objects-nvidia-515-5.15.0-43-generic 5.15.0-43.46 amd64 Linux kernel nvidia modules for version 5.15.0-43 (objects)
ii linux-signatures-nvidia-5.15.0-43-generic 5.15.0-43.46 amd64 Linux kernel signatures for nvidia modules for version 5.15.0-43-generic
ii nvidia-compute-utils-515 515.48.07-0ubuntu0.22.04.2 amd64 NVIDIA compute utilities
ii nvidia-container-toolkit 1.10.0-1 amd64 NVIDIA container runtime hook
ii nvidia-docker2 2.11.0-1 all nvidia-docker CLI wrapper
ii nvidia-driver-515 515.48.07-0ubuntu0.22.04.2 amd64 NVIDIA driver metapackage
ii nvidia-kernel-common-515 515.48.07-0ubuntu0.22.04.2 amd64 Shared files used with the kernel module
ii nvidia-kernel-source-515 515.48.07-0ubuntu0.22.04.2 amd64 NVIDIA kernel source package
ii nvidia-prime 0.8.17.1 all Tools to enable NVIDIA's Prime
ii nvidia-settings 510.47.03-0ubuntu1 amd64 Tool for configuring the NVIDIA graphics driver
ii nvidia-utils-515 515.48.07-0ubuntu0.22.04.2 amd64 NVIDIA driver support binaries
ii xserver-xorg-video-nvidia-515 515.48.07-0ubuntu0.22.04.2 amd64 NVIDIA binary Xorg driver
NVIDIA container library version from nvidia-container-cli -V
1. Issue or feature description
After a random amount of time (it could be hours or days) the GPUs become unavailable inside all the running containers and
nvidia-smireturns "Failed to initialize NVML: Unknown Error".A restart of all the containers fixes the issue and the GPUs return available.
Outside the containers the GPUs are still working correctly.
I tried searching in the open/closed issues but I could not find any solution.
2. Steps to reproduce the issue
All the containers are run with
docker run --gpus all -it tensorflow/tensorflow:latest-gpu /bin/bash3. Information to attach
nvidia-container-cli -k -d /dev/tty infouname -anvidia-smi -adocker versiondpkg -l '*nvidia*'orrpm -qa '*nvidia*'nvidia-container-cli -V