GPU 0 gets duplicate processes from processes running on other GPUs

I am training different models on different GPUs simultaneously in one of my remote machines, and I found that the processes running on GPUs whose id isn’t 0 are somehow duplicated on GPU whose id is 0, as shown below:

As you can see from the PIDs, the processes running on GPU_i with i nonzero are also duplicated on GPU_0. I am running the exact same code in each of these GPUs (with different hyperparameters). This is really puzzling because in another remote machine with very similar settings, I could not reproduce this behavior even though I ran exactly the same code in each GPU in exactly the same way. In that machine, there was no duplication of processes to GPU_0, only one process in each GPU.

So, I concluded that this is likely not due to a bug in my code, but due to external settings like Pytorch, cuda, or driver version. The other machine I tried to reproduce this behavior has the exactly the same version of driver (387.26) and same number of the same GPU (Titan-XP), but they have somewhat different versions of pytorch and cuda. Shown below is the exact versions of the packages installed (all installed by conda) in each machine (the first one is from the machine where I had this problem, and the second from the other machine):
(The problematic machine)

certifi                   2016.2.28                py36_0  
cffi                      1.10.0                   py36_0  
cudatoolkit               8.0                           3    anaconda
cudnn                     6.0.21                cuda8.0_0  
cycler                    0.10.0                   py36_0  
cycler                    0.10.0                    <pip>
dbus                      1.10.20                       0  
decorator                 4.1.2                    py36_0  
expat                     2.1.0                         0  
fontconfig                2.12.1                        3  
freetype                  2.5.5                         2  
glib                      2.50.2                        1  
gst-plugins-base          1.8.0                         0  
gstreamer                 1.8.0                         0  
icu                       54.1                          0  
imageio                   2.3.0                    py36_0    conda-forge
intel-openmp              2018.0.0             hc7b2577_8  
jbig                      2.1                           0  
jpeg                      9b                            0  
libffi                    3.2.1                         1  
libgcc                    5.2.0                         0  
libgcc-ng                 7.2.0                h7cc24e2_2  
libgfortran               3.0.0                         1  
libgfortran-ng            7.2.0                h9f7466a_2  
libiconv                  1.14                          0  
libpng                    1.6.30                        1  
libstdcxx-ng              7.2.0                h7a57d05_2  
libtiff                   4.0.6                         3  
libxcb                    1.12                          1  
libxml2                   2.9.4                         0  
matplotlib                2.0.2                     <pip>
matplotlib                2.0.2               np113py36_0  
mkl                       2018.0.1             h19d6760_4  
nccl                      1.3.4                 cuda8.0_1  
networkx                  1.11                     py36_0  
numpy                     1.13.3           py36ha12f23b_0  
olefile                   0.44                     py36_0  
openssl                   1.0.2l                        0  
pcre                      8.39                          1  
pillow                    4.2.1                    py36_0  
pip                       9.0.1                    py36_1  
pycparser                 2.18                     py36_0  
pyparsing                 2.2.0                    py36_0  
pyparsing                 2.2.0                     <pip>
pyqt                      5.6.0                    py36_2  
python                    3.6.2                         0  
python-dateutil           2.6.1                     <pip>
python-dateutil           2.6.1                    py36_0  
pytorch                   0.3.0           py36_cuda8.0.61_cudnn7.0.3h37a80b5_4    pytorch
pytz                      2017.2                   py36_0  
pytz                      2017.2                    <pip>
pywavelets                0.5.2               np113py36_0  
qt                        5.6.2                         5  
readline                  6.2                           2  
scikit-image              0.13.0              np113py36_0  
scipy                     1.0.0            py36hbf646e7_0  
setuptools                36.4.0                   py36_0  
sip                       4.18                     py36_0  
six                       1.10.0                   py36_0  
sqlite                    3.13.0                        0  
tk                        8.5.18                        0  
torchvision               0.2.0            py36h17b6947_1    pytorch
wheel                     0.29.0                   py36_0  
xz                        5.2.3                         0  
zlib                      1.2.11                        0  

(The other machine)

ca-certificates           2018.03.07                    0  
certifi                   2018.4.16                py36_0  
cffi                      1.11.5           py36h9745a5d_0  
cuda91                    1.0                  h4c16780_0    pytorch
cycler                    0.10.0           py36h93f1223_0  
dbus                      1.13.2               h714fa37_1  
expat                     2.2.5                he0dffb1_0  
fontconfig                2.12.6               h49f89f6_0  
freetype                  2.8                  hab7d2ae_1  
glib                      2.56.1               h000015b_0  
gst-plugins-base          1.14.0               hbbd80ab_1  
gstreamer                 1.14.0               hb453b48_1  
icu                       58.2                 h9c2bf20_1  
imageio                   2.3.0                    py36_0  
intel-openmp              2018.0.0                      8  
jpeg                      9b                   h024ee3a_2  
kiwisolver                1.0.1            py36h764f252_0  
libedit                   3.1                  heed3624_0  
libffi                    3.2.1                hd88cf55_4  
libgcc-ng                 7.2.0                hdf63c60_3  
libgfortran-ng            7.2.0                hdf63c60_3  
libpng                    1.6.34               hb9fc6fc_0  
libstdcxx-ng              7.2.0                hdf63c60_3  
libtiff                   4.0.9                h28f6b97_0  
libxcb                    1.13                 h1bed415_1  
libxml2                   2.9.8                hf84eae3_0  
matplotlib                2.2.2            py36h0e671d2_1  
mkl                       2018.0.2                      1  
mkl_fft                   1.0.1            py36h3010b51_0  
mkl_random                1.0.1            py36h629b387_0  
ncurses                   6.0                  h9df7e31_2  
numpy                     1.14.2           py36hdbf6ddf_1  
olefile                   0.45.1                   py36_0  
openssl                   1.0.2o               h20670df_0  
pcre                      8.42                 h439df22_0  
pillow                    5.1.0            py36h3deb7b8_0  
pip                       9.0.3                    py36_0  
pycparser                 2.18             py36hf9f622e_1  
pyparsing                 2.2.0            py36hee85983_1  
pyqt                      5.9.2            py36h751905a_0  
python                    3.6.4                hc3d631a_3  
python-dateutil           2.7.2                    py36_0  
pytorch                   0.3.1           py36_cuda9.1.85_cudnn7.0.5_2  [cuda91]  pytorch
pytz                      2018.4                   py36_0  
qt                        5.9.5                h7e424d6_0  
readline                  7.0                  ha6073c6_4  
scipy                     1.0.1            py36hfc37229_0  
setuptools                39.0.1                   py36_0  
sip                       4.19.8           py36hf484d3e_0  
six                       1.11.0           py36h372c433_1  
sqlite                    3.23.1               he433501_0  
tk                        8.6.7                hc745277_3  
torchvision               0.2.0            py36h17b6947_1    pytorch
tornado                   5.0.2                    py36_0  
wheel                     0.31.0                   py36_0  
xz                        5.2.3                h55aa19d_2  
zlib                      1.2.11               ha838bed_2  

As you can see, their Pytorch and cuda versions are different. Can this be the cause of this weird behavior?
(By the way, not sure if this is relevant information, but the cuda version I get from nvcc --version in both machines is V7.5.17, possibly due to remnants of the previous installations)

How did you run the different processes?
Did you use CUDA_VISIBLE_DEVICES=ID in your environment or did you push the tensors to the appropriate GPU in the different codes?

If you’ve used the second approach, PyTorch still sees all GPUs and uses GPU0 by default to initialize CUDA as far as I know.

Say my code is I create a tmux session and 8 panes in it. In i-th panel, I run 'python --gpu_idx i" where i starts from 0 to 7 and --gpu_idx indicates the device id of the gpu I want to use. At the very beginning of the code there is torch.cuda.set_device(args.gpu_idx) to set the gpu to use. Then, I apply .cuda() to the network and loss function, and in the training loop, I wrap the data and labels tensors as Variables, and do data = data.cuda() and labels = labels.cuda(), in a very standard way. I never used CUDA_VISIBLE_DEVICES=ID. So, I think I used the second approach you are mentioning. But, what is really strange is that in the other machine, the very same code works just as expected, meaning that running --gpu_idx i only creates a process in the corresponding gpu, and not in GPU0.

Ok, I see.
Could you try to use


and see if this problem still occurs.

Also, since you are using 0.3.x, I would suggest to update to the latest stable release 0.4.0.
It has some nice features and bug fixes. Have a look at the Migration Guide.

Unfortunately, I cannot test your script on my machine at the moment, since it’s busy.

I think there was a bug in pytorch 0.3 that was fixed in 0.3.1 where if the dataloader was using pinned memory it created context on a 0-th gpu even if 0-th gpu was not used. Probably that’s what you are seeing. As @ptrblck is suggesting, try updating to pytorch 0.4, this is the latest stable version.

1 Like

Based on the fact that the exact same code doesn’t have such a problem in pytorch 0.3.1. and I indeed used pinned memory, I think ngimel’s answer is probably right, although I haven’t tried to install 0.3.1 and revisit the problem. I will probably install 0.4 later. Thank you so much guys!

For future reference, I confirm that upgrading to 0.3.1 solved the problem.

I am using PyTorch v1.0.0 and seem to be running into the same issue as described in the original post. Duplicate processes are started on GPU 0 from processes running on the other GPUs. See screenshot for an example.

Hi, I am facing the same issue. Did you solve this problem?

Hi @ptrblck, I’m still facing this issue now with DDP being set up by default. Any update on what might have caused this?

The original issue was created for PyTorch 0.3, so let’s focus on your new problem instead. :wink:

Are you using any device-specific operations such as empty_cache without specifying the device id?
This could create a new CUDA context on the default device from another process, if you haven’t masked them via CUDA_VISIBLE_DEVICES.
If that’s not the case, could you post your code, so that we could have a look?

Ooh, there is a bunch of those, so it took me a few days to chase it out. It was indeed a device-specific op without specifying the ID (or rather, I parsed the args afterwards :man_facepalming:)!