Training multiple models on multiple GPUs hangs


When I run multiple train sessions on multiple GPUs (one model per GPU), I am getting repeatable problems on one GPU (GPU 3). Note that this GPU is the only one configured for video output as well.

a.) If I run the first training on the affected GPU 3, the training hangs as soon as I start two or more training sessions on other GPUs. (GPU 3 is not available afterward, reboot required.)
b.) If GPU 3 is the last to run the training, the entire system freezes (hard reboot required).

If I train only on the affected GPU 3, it runs without any problems.

PyTorch version: 1.7.1
CUDA used to build PyTorch: 11.0
ROCM used to build PyTorch: N/A

OS: Microsoft Windows 10 Enterprise

Python version: 3.8 (64-bit runtime)
Is CUDA available: True
CUDA runtime version: 11.0
GPU models and configuration: 4 x RTX3090
Nvidia driver version: 461.40
cuDNN version: 8.0.4

Versions of relevant libraries:
[pip3] numpy==1.19.2
[pip3] torch==1.7.1
[pip3] torchvision==0.8.2
[conda] blas 1.0 mkl conda-forge
[conda] cudatoolkit 11.0.221 h74a9793_0 anaconda
[conda] mkl 2020.2 256 anaconda
[conda] mkl-service 2.3.0 py38hb782905_0
[conda] mkl_fft 1.2.0 py38h45dec08_0
[conda] mkl_random 1.1.1 py38h47e9c7a_0 anaconda
[conda] numpy 1.19.2 py38hadc3359_0
[conda] numpy-base 1.19.2 py38ha3acd2a_0
[conda] pytorch 1.7.1 py3.8_cuda110_cudnn8_0 pytorch
[conda] torchvision 0.8.2 py38_cu110 pytorch

CPU: AMD Ryzen Threadripper 3970X

I have already tried the solutions suggested here: Repeatable system freezes under GPU load with Threadripper & Ubuntu 18.04 - GPU - Level1Techs Forums

Current workaround: I figured out that if I use smaller batch sizes and set num_workers of the training DataLoaders not to 0 there seem to be no problems.
However, I want to find the cause of the problem.

Edit: a simple example in Tensorflow runs without any problems on all GPUs.