The error is occurring when I am calling forward on a sequence of convolutional operations in my code.
I have used the same configuration on other GPUs, such as K80, P100, V100 and RTX8000, and never had the error. Anybody found a solution to this issue?
Just updated to the latest torch 1.8.1 and now the error has changed to:
....python3.7/site-packages/torch/cuda/__init__.py:104: UserWarning:
A100-PCIE-40GB with CUDA capability sm_80 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70.
If you want to use the A100-PCIE-40GB GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/
I had a look at the link provided, but I am not very sure how I can update or change my pytorch installation to allow sm_80.
I think I fixed my own problem, but posting the answer here in case anybody ever comes across this later down the line.
The first issue comes from the fact that (at least in my case) the A100 GPUs comes with sm_80, but also with pre-installed CUDA 11+. In this case, I was sabotaging myself by module loading CUDA 10.1 and cudnn 7.6.5.32 on the cluster.
The solution came by using the CUDA 11.1 version of pytorch from here: Start Locally | PyTorch
Then, I just stopped loading CUDA 10.1 and cudnn 7.6.5 as separate modules, which made all errors go away. Running a few tests jobs now, and will post any updates in case other things get broken.