When I run pytorch on the newly installed RTX 3080ti machine, “segmentation fault” or “CUDA error unspecified launch failure” always appears.
My hardware configuration is i5-11600kf + 32GB + rtx3080ti.
The software configuration is anaconda3 + Python 3.8.11 + Pytorch 1.8.1 + cuda11.1 + cudnn8.0.5 + NVIDIA driver 460.
I’ve also tried pytorch 1.9.0 and CUDA 11.2, but they all have the same problem. I’ve tried many versions, but when they all start running, the loss drops normally, but there will be an error in less than 10 minutes. In addition, the torch model is mainly based on the LSTM layer, and the code is normal on tasla-v100-12g.
I guess you’ve build PyTorch from source using this CUDA version or did you use a pip wheel from another source (as the official binaries would either ship with CUDA10.2 (won’t work for your 3080) or CUDA11.1)?
In any case, could you post a minimal, executable code snippet which we could use to reproduce this issue, please?
I may find some problems.
After many times of verification, when the GPU temperature exceeds 65 ℃ for a long time, there will be errors in training, resulting in training interruption. If the temperature is reduced to below 50 ℃, it can be trained normally.