Experiencing performance regression on 3090

We are experiencing perofrmance regression on RTX3090 with pytorch, many people in my lab have also experienced the same issue. The accuracy of the models trained with RTX3090 are usually 0.5~1% lower than the ones trained with RTX2080Ti.

When trainnig models with mmdetection using DDP, I also notice the DDP brings less accleration rate on RTX3090 compared to RTX2080Ti.

Wondering if anyone could help us out, thanks a lot.

Here is our environment:

Collecting environment information...
PyTorch version: 1.7.1
Is debug build: False
CUDA used to build PyTorch: 11.0
ROCM used to build PyTorch: N/A

OS: Ubuntu 18.04.5 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: Could not collect
CMake version: version 3.10.2

Python version: 3.8 (64-bit runtime)
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration: 
GPU 0: GeForce RTX 3090
GPU 1: GeForce RTX 3090

Nvidia driver version: 455.32.00
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.19.2
[pip3] torch==1.7.1
[pip3] torchaudio==0.7.0a0+a853dff
[pip3] torchvision==0.8.2
[conda] blas                      1.0                         mkl    https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
[conda] cudatoolkit               11.0.221             h6bb024c_0    https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
[conda] mkl                       2020.2                      256    https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
[conda] mkl-service               2.3.0            py38he904b0f_0    https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
[conda] mkl_fft                   1.2.0            py38h23d657b_0    https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
[conda] mkl_random                1.1.1            py38h0573a6f_0    https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
[conda] numpy                     1.19.2           py38h54aff64_0    https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
[conda] numpy-base                1.19.2           py38hfa32c7d_0    https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
[conda] pytorch                   1.7.1           py3.8_cuda11.0.221_cudnn8.0.5_0    https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/pytorch
[conda] torchaudio                0.7.2                      py38    https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/pytorch
[conda] torchvision               0.8.2                py38_cu110    https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/pytorch

The performance regression might come from other libraries such as cudnn and the 1.7.1 binary is using cudnn8.0.5. Could you post the model architecture and all necessary input shapes, so that we could check, which kernels are called?

Could you disable TF32 via torch.backends.cuda.matmul.allow_tf32 = False and torch.backends.cudnn.allow_tf32 = False and check the accuracy for a new run?

1 Like

Disabling TF32 seems to be the solution since using FP16 training in my task will also brings performance regression. I will further investigate on this issue and findings will updated below once finished.

Thanks a lot!

Could you post the complete training and the mean +/-stddev of the final accuracy you are seeing so that we can try to reproduce it?

According to my experiment, disabling TF32 is not the solution.

For easier reproducibility and more efficient experiments, I trained several models using deep-person-reid, the commit hash is 57dad6.

I just execute python scripts/main.py --config-file configs/im_osnet_x1_0_softmax_256x128_amsgrad_cosine.yaml --transforms random_flip random_erase --root $PATH_TO_DATA in the readme.md. For each setting, the experiment was repeated for 5 times.

The driver, torch, cuda… versions are kept the same:

Collecting environment information...
PyTorch version: 1.7.1
Is debug build: False
CUDA used to build PyTorch: 11.0
ROCM used to build PyTorch: N/A

OS: Ubuntu 18.04.5 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: Could not collect
CMake version: Could not collect

Python version: 3.8 (64-bit runtime)
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration: 
GPU 0: GeForce RTX 2080 Ti
GPU 1: GeForce RTX 2080 Ti
GPU 2: GeForce RTX 2080 Ti

Nvidia driver version: 460.27.04
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.19.2
[pip3] torch==1.7.1
[pip3] torchaudio==0.7.0a0+a853dff
[pip3] torchreid==1.3.3
[pip3] torchvision==0.8.2
[conda] blas                      1.0                         mkl    defaults
[conda] cudatoolkit               11.0.221             h6bb024c_0    defaults
[conda] mkl                       2020.2                      256    defaults
[conda] mkl-service               2.3.0            py38he904b0f_0    defaults
[conda] mkl_fft                   1.2.0            py38h23d657b_0    defaults
[conda] mkl_random                1.1.1            py38h0573a6f_0    defaults
[conda] numpy                     1.19.2           py38h54aff64_0    defaults
[conda] numpy-base                1.19.2           py38hfa32c7d_0    defaults
[conda] pytorch                   1.7.1           py3.8_cuda11.0.221_cudnn8.0.5_0    pytorch
[conda] torchaudio                0.7.2                      py38    pytorch
[conda] torchreid                 1.3.3                     dev_0    <develop>
[conda] torchvision               0.8.2                py38_cu110    pytorch
PyTorch version: 1.7.1
Is debug build: False
CUDA used to build PyTorch: 11.0
ROCM used to build PyTorch: N/A

OS: Ubuntu 18.04.5 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: 6.0.0-1ubuntu2 (tags/RELEASE_600/final)
CMake version: version 3.10.2

Python version: 3.8 (64-bit runtime)
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration: 
GPU 0: GeForce RTX 3090
GPU 1: GeForce RTX 3090

Nvidia driver version: 460.27.04
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.19.2
[pip3] torch==1.7.1
[pip3] torchaudio==0.7.0a0+a853dff
[pip3] torchreid==1.3.3
[pip3] torchvision==0.8.2
[conda] blas                      1.0                         mkl    https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
[conda] cudatoolkit               11.0.221             h6bb024c_0    https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
[conda] mkl                       2020.2                      256    https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
[conda] mkl-service               2.3.0            py38he904b0f_0    https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
[conda] mkl_fft                   1.2.0            py38h23d657b_0    https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
[conda] mkl_random                1.1.1            py38h0573a6f_0    https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
[conda] numpy                     1.19.2           py38h54aff64_0    https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
[conda] numpy-base                1.19.2           py38hfa32c7d_0    https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
[conda] pytorch                   1.7.1           py3.8_cuda11.0.221_cudnn8.0.5_0    https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/pytorch
[conda] torchaudio                0.7.2                      py38    https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/pytorch
[conda] torchreid                 1.3.3                     dev_0    <develop>
[conda] torchvision               0.8.2                py38_cu110    https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/pytorch

I made following changes to disable TF32:

index 61aa49d..2758308 100755
--- a/scripts/main.py
+++ b/scripts/main.py
@@ -188,4 +188,6 @@ def main():
 
 
 if __name__ == '__main__':
+    torch.backends.cuda.matmul.allow_tf32 = False
+    torch.backends.cudnn.allow_tf32 = False
     main()

The regression is marginal bacause the model is simple and the task is easy, but it does exist. For more complex network structures and tasks I am working on, the regression would be more evident.

GPU mAP[stdev](%) Rank-1[stdev](%)
3090 85.62[0.15] 94.52[0.27]
3090 TF32=False 85.56[0.13] 94.58[0.16]
2080Ti 86.74[0.19] 94.70[0.24]

Also, I find someone else encountered similar issue. Maybe the performance regression did come from other libraries.

Thanks for the experiment stats. Based on the mean and std it doesn’t seem that TF32 or the GPU is changing the final results significantly in the 5 runs.
To further isolate the difference, you could try to make the runs deterministically (e.g. to remove the influence of shuffling the dataset) and also make sure to use the same library stack (same PyTorch, CUDA, cudnn, etc. versions).