Hi!
I’m encountering an issue where the backward pass of torch.nn.functional.scaled_dot_product_attention
fails on a H100 GPU but doesn’t on an A100 GPU.
I’ve tested this with the following script
import logging
import sys
import torch
import torch.nn.functional as F
def main():
# setup
logger = logging.getLogger()
logger.setLevel(logging.INFO)
handler = logging.StreamHandler(stream=sys.stdout)
formatter = logging.Formatter(
fmt=f"%(asctime)s %(levelname).1s %(message)s",
datefmt="%m-%d %H:%M:%S",
)
handler.setFormatter(formatter)
logger.handlers.append(handler)
device = torch.device("cuda:0")
# log versions
logging.info(f"torch.version {torch.__version__}")
logging.info(f"torch.version.cuda {torch.version.cuda}")
logging.info(f"device name {torch.cuda.get_device_name()}")
logging.info(f"device capability {torch.cuda.get_device_capability()}")
logging.info(f"device properties {torch.cuda.get_device_properties(device)}")
# init qkv
dim = 768
num_heads = 16
qkv = torch.nn.Linear(dim, dim * 3).to(device)
# simulate forward pass of a VisionTransformer
x = torch.randn(4, 197, dim, device=device)
B, N, C = x.shape
logging.info("qkv")
qkv = qkv(x).reshape(B, N, 3, num_heads, C // num_heads).permute(2, 0, 3, 1, 4)
q, k, v = qkv.unbind(0)
# forward/backward
logging.info("scaled_dot_product_attention")
with torch.autocast("cuda", dtype=torch.bfloat16):
x = F.scaled_dot_product_attention(q, k, v)
logging.info("backward")
x.mean().backward()
logging.info("fin")
if __name__ == "__main__":
main()
which results in the following output on a H100:
04-07 11:30:42 I torch.version 2.0.0+cu118
04-07 11:30:42 I torch.version.cuda 11.8
04-07 11:30:42 I device name NVIDIA H100 PCIe
04-07 11:30:42 I device capability (9, 0)
04-07 11:30:42 I device properties _CudaDeviceProperties(name='NVIDIA H100 PCIe', major=9, minor=0, total_memory=81075MB, multi_processor_count=114)
04-07 11:30:42 I qkv
04-07 11:30:44 I scaled_dot_product_attention
04-07 11:30:44 I backward
Traceback (most recent call last):
File ".../scripts/setup_native_flash_attn.py", line 49, in <module>
main()
File ".../scripts/setup_native_flash_attn.py", line 44, in main
x.mean().backward()
File ".../lib/python3.9/site-packages/torch/_tensor.py", line 487, in backward
torch.autograd.backward(
File ".../lib/python3.9/site-packages/torch/autograd/__init__.py", line 200, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: CUDA error: an illegal instruction was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
and on the following output on a A100:
04-07 11:33:38 I torch.version 2.0.0+cu118
04-07 11:33:38 I torch.version.cuda 11.8
04-07 11:33:38 I device name NVIDIA A100-PCIE-40GB
04-07 11:33:38 I device capability (8, 0)
04-07 11:33:38 I device properties _CudaDeviceProperties(name='NVIDIA A100-PCIE-40GB', major=8, minor=0, total_memory=40384MB, multi_processor_count=108)
04-07 11:33:39 I qkv
04-07 11:33:40 I scaled_dot_product_attention
04-07 11:33:40 I backward
04-07 11:33:40 I fin
- the same thing happens without mixed-precision
- setting
CUDA_LAUNCH_BLOCKING=1
also doesn’t work
Is the H100 not supported yet, or am I missing something here?