We are experiencing perofrmance regression on RTX3090 with pytorch, many people in my lab have also experienced the same issue. The accuracy of the models trained with RTX3090 are usually 0.5~1% lower than the ones trained with RTX2080Ti.
When trainnig models with mmdetection using DDP, I also notice the DDP brings less accleration rate on RTX3090 compared to RTX2080Ti.
Wondering if anyone could help us out, thanks a lot.
Here is our environment:
Collecting environment information...
PyTorch version: 1.7.1
Is debug build: False
CUDA used to build PyTorch: 11.0
ROCM used to build PyTorch: N/A
OS: Ubuntu 18.04.5 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: Could not collect
CMake version: version 3.10.2
Python version: 3.8 (64-bit runtime)
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration:
GPU 0: GeForce RTX 3090
GPU 1: GeForce RTX 3090
Nvidia driver version: 455.32.00
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Versions of relevant libraries:
[pip3] numpy==1.19.2
[pip3] torch==1.7.1
[pip3] torchaudio==0.7.0a0+a853dff
[pip3] torchvision==0.8.2
[conda] blas 1.0 mkl https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
[conda] cudatoolkit 11.0.221 h6bb024c_0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
[conda] mkl 2020.2 256 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
[conda] mkl-service 2.3.0 py38he904b0f_0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
[conda] mkl_fft 1.2.0 py38h23d657b_0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
[conda] mkl_random 1.1.1 py38h0573a6f_0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
[conda] numpy 1.19.2 py38h54aff64_0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
[conda] numpy-base 1.19.2 py38hfa32c7d_0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
[conda] pytorch 1.7.1 py3.8_cuda11.0.221_cudnn8.0.5_0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/pytorch
[conda] torchaudio 0.7.2 py38 https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/pytorch
[conda] torchvision 0.8.2 py38_cu110 https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/pytorch
The performance regression might come from other libraries such as cudnn and the 1.7.1 binary is using cudnn8.0.5. Could you post the model architecture and all necessary input shapes, so that we could check, which kernels are called?
Could you disable TF32 via torch.backends.cuda.matmul.allow_tf32 = False and torch.backends.cudnn.allow_tf32 = False and check the accuracy for a new run?
Disabling TF32 seems to be the solution since using FP16 training in my task will also brings performance regression. I will further investigate on this issue and findings will updated below once finished.
According to my experiment, disabling TF32 is not the solution.
For easier reproducibility and more efficient experiments, I trained several models using deep-person-reid, the commit hash is 57dad6.
I just execute python scripts/main.py --config-file configs/im_osnet_x1_0_softmax_256x128_amsgrad_cosine.yaml --transforms random_flip random_erase --root $PATH_TO_DATA in the readme.md. For each setting, the experiment was repeated for 5 times.
The driver, torch, cuda… versions are kept the same:
Collecting environment information...
PyTorch version: 1.7.1
Is debug build: False
CUDA used to build PyTorch: 11.0
ROCM used to build PyTorch: N/A
OS: Ubuntu 18.04.5 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: Could not collect
CMake version: Could not collect
Python version: 3.8 (64-bit runtime)
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration:
GPU 0: GeForce RTX 2080 Ti
GPU 1: GeForce RTX 2080 Ti
GPU 2: GeForce RTX 2080 Ti
Nvidia driver version: 460.27.04
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Versions of relevant libraries:
[pip3] numpy==1.19.2
[pip3] torch==1.7.1
[pip3] torchaudio==0.7.0a0+a853dff
[pip3] torchreid==1.3.3
[pip3] torchvision==0.8.2
[conda] blas 1.0 mkl defaults
[conda] cudatoolkit 11.0.221 h6bb024c_0 defaults
[conda] mkl 2020.2 256 defaults
[conda] mkl-service 2.3.0 py38he904b0f_0 defaults
[conda] mkl_fft 1.2.0 py38h23d657b_0 defaults
[conda] mkl_random 1.1.1 py38h0573a6f_0 defaults
[conda] numpy 1.19.2 py38h54aff64_0 defaults
[conda] numpy-base 1.19.2 py38hfa32c7d_0 defaults
[conda] pytorch 1.7.1 py3.8_cuda11.0.221_cudnn8.0.5_0 pytorch
[conda] torchaudio 0.7.2 py38 pytorch
[conda] torchreid 1.3.3 dev_0 <develop>
[conda] torchvision 0.8.2 py38_cu110 pytorch
PyTorch version: 1.7.1
Is debug build: False
CUDA used to build PyTorch: 11.0
ROCM used to build PyTorch: N/A
OS: Ubuntu 18.04.5 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: 6.0.0-1ubuntu2 (tags/RELEASE_600/final)
CMake version: version 3.10.2
Python version: 3.8 (64-bit runtime)
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration:
GPU 0: GeForce RTX 3090
GPU 1: GeForce RTX 3090
Nvidia driver version: 460.27.04
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Versions of relevant libraries:
[pip3] numpy==1.19.2
[pip3] torch==1.7.1
[pip3] torchaudio==0.7.0a0+a853dff
[pip3] torchreid==1.3.3
[pip3] torchvision==0.8.2
[conda] blas 1.0 mkl https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
[conda] cudatoolkit 11.0.221 h6bb024c_0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
[conda] mkl 2020.2 256 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
[conda] mkl-service 2.3.0 py38he904b0f_0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
[conda] mkl_fft 1.2.0 py38h23d657b_0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
[conda] mkl_random 1.1.1 py38h0573a6f_0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
[conda] numpy 1.19.2 py38h54aff64_0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
[conda] numpy-base 1.19.2 py38hfa32c7d_0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
[conda] pytorch 1.7.1 py3.8_cuda11.0.221_cudnn8.0.5_0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/pytorch
[conda] torchaudio 0.7.2 py38 https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/pytorch
[conda] torchreid 1.3.3 dev_0 <develop>
[conda] torchvision 0.8.2 py38_cu110 https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/pytorch
The regression is marginal bacause the model is simple and the task is easy, but it does exist. For more complex network structures and tasks I am working on, the regression would be more evident.
GPU
mAP[stdev](%)
Rank-1[stdev](%)
3090
85.62[0.15]
94.52[0.27]
3090 TF32=False
85.56[0.13]
94.58[0.16]
2080Ti
86.74[0.19]
94.70[0.24]
Also, I find someone else encountered similar issue. Maybe the performance regression did come from other libraries.
Thanks for the experiment stats. Based on the mean and std it doesn’t seem that TF32 or the GPU is changing the final results significantly in the 5 runs.
To further isolate the difference, you could try to make the runs deterministically (e.g. to remove the influence of shuffling the dataset) and also make sure to use the same library stack (same PyTorch, CUDA, cudnn, etc. versions).