FSDP tutorial: CUDA error: CUBLAS_STATUS_INVALID_VALUE when calling `cublasSgemm

Hi there,

I’m getting the following error when running the code (copied exactly) from the FSDP tutorial:

RuntimeError: CUDA error: CUBLAS_STATUS_INVALID_VALUE when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)`

I read in other threads that this can be due to input size not matching layer size, but I have checked this. I have also tried reducing batch to 1. This error still persists, any ideas?

> python -m torch.utils.collect_env          
Collecting environment information...
PyTorch version: 1.13.0+cu117
Is debug build: False
CUDA used to build PyTorch: 11.7
ROCM used to build PyTorch: N/A

OS: Red Hat Enterprise Linux Server release 7.9 (Maipo) (x86_64)
GCC version: (GCC) 10.2.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.17

Python version: 3.8.6 (default, Mar 29 2021, 14:28:48)  [GCC 10.2.0] (64-bit runtime)
Python platform: Linux-3.10.0-1160.66.1.el7.x86_64-x86_64-with-glibc2.2.5
Is CUDA available: True
CUDA runtime version: 11.6.55
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: 
GPU 0: NVIDIA A100 80GB PCIe
GPU 1: NVIDIA A100 80GB PCIe

Nvidia driver version: 515.65.01
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

Versions of relevant libraries:
[pip3] numpy==1.23.3
[pip3] pytorch-lightning==1.8.3
[pip3] torch==1.13.0
[pip3] torchio==0.18.84
[pip3] torchmetrics==0.9.3
[pip3] torchvision==0.14.0
[conda] Could not collect

Issue was mismatched version of CUDA (v11.6) to CUDA version used to build pytorch (v11.7). Upgrading to CUDA v11.7 fixed the problem.