RuntimeError: No available kernel.  Aborting execution.

When I run the inference logic using the following script, I get `RuntimeError: No available kernel.  Aborting execution.` error:

```
A100 GPU detected, using flash attention if input tensor is on cuda
  0%|                                                                                                                                                                                                                                                                                                                                      | 0/251 [00:00<?, ?it/s]/home/azureuser/PaLM/.venv/lib/python3.8/site-packages/palm_rlhf_pytorch/attention.py:100: UserWarning: Memory efficient kernel not used because: (Triggered internally at ../aten/src/ATen/native/transformers/cuda/sdp_utils.cpp:659.)
  out = F.scaled_dot_product_attention(
/home/azureuser/PaLM/.venv/lib/python3.8/site-packages/palm_rlhf_pytorch/attention.py:100: UserWarning: Memory Efficient attention has been runtime disabled. (Triggered internally at ../aten/src/ATen/native/transformers/cuda/sdp_utils.cpp:450.)
  out = F.scaled_dot_product_attention(
/home/azureuser/PaLM/.venv/lib/python3.8/site-packages/palm_rlhf_pytorch/attention.py:100: UserWarning: Flash attention kernel not used because: (Triggered internally at ../aten/src/ATen/native/transformers/cuda/sdp_utils.cpp:661.)
  out = F.scaled_dot_product_attention(
/home/azureuser/PaLM/.venv/lib/python3.8/site-packages/palm_rlhf_pytorch/attention.py:100: UserWarning: Expected query, key and value to all be of dtype: {Half, BFloat16}. Got Query dtype: float, Key dtype: float, and Value dtype: float instead. (Triggered internally at ../aten/src/ATen/native/transformers/cuda/sdp_utils.cpp:100.)
  out = F.scaled_dot_product_attention(
  0%|                                                                                                                                                                                                                                                                                                                                      | 0/251 [00:00<?, ?it/s]
Traceback (most recent call last):

... <truncated>

  File "/home/azureuser/PaLM/.venv/lib/python3.8/site-packages/palm_rlhf_pytorch/attention.py", line 100, in flash_attn
    out = F.scaled_dot_product_attention(
RuntimeError: No available kernel.  Aborting execution.
```

I tried installing the Pytorch nightly version and that did not help:

```
pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu121
```

NVIDIA driver version:

```
/usr/local/cuda/bin/nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Mon_Oct_12_20:09:46_PDT_2020
Cuda compilation tools, release 11.1, V11.1.105
Build cuda_11.1.TC455_06.29190527_0
```

PyTorch version:

```
pip3 show  torch
Name: torch
Version: 2.1.0.dev20230618+cu121
Summary: Tensors and Dynamic neural networks in Python with strong GPU acceleration
Home-page: https://pytorch.org/
Author: PyTorch Team
Author-email: packages@pytorch.org
License: BSD-3
Location: /home/azureuser/PaLM/.venv/lib/python3.8/site-packages
Requires: filelock, pytorch-triton, sympy, networkx, jinja2, fsspec, typing-extensions
Required-by: torchvision, torchaudio, PaLM-rlhf-pytorch, lion-pytorch, accelerate
```

Any idea what could cause this?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

RuntimeError: No available kernel. Aborting execution. #9

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

RuntimeError: No available kernel. Aborting execution. #9

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions