All GPUs used when relocating tensor to CUDA

DzReal · January 4, 2021, 12:08am

Hi, all

I noticed an issue with my current setup. When I simply execute the following code snippet, all GPUs will be occupied. Specifically, cuda 0 is where the data is relocated to. The other GPUs have zero usage, but are still displayed for some reason.

import torch
a = torch.rand(10)
b = a.cuda()

Below is an attached GPU usage stats returned from nvidia-smi

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.32.00    Driver Version: 455.32.00    CUDA Version: 11.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce GTX TIT...  On   | 00000000:04:00.0 Off |                  N/A |
| 22%   31C    P2    68W / 250W |    519MiB / 12212MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX TIT...  On   | 00000000:05:00.0 Off |                  N/A |
| 22%   27C    P8    15W / 250W |      4MiB / 12212MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX TIT...  On   | 00000000:08:00.0 Off |                  N/A |
| 22%   28C    P8    15W / 250W |      4MiB / 12212MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  GeForce GTX TIT...  On   | 00000000:09:00.0 Off |                  N/A |
| 22%   26C    P8    15W / 250W |      4MiB / 12212MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   4  GeForce GTX TIT...  On   | 00000000:85:00.0 Off |                  N/A |
| 22%   27C    P8    14W / 250W |      4MiB / 12212MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   5  GeForce GTX TIT...  On   | 00000000:86:00.0 Off |                  N/A |
| 22%   27C    P8    15W / 250W |      4MiB / 12212MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   6  GeForce GTX TIT...  On   | 00000000:89:00.0 Off |                  N/A |
| 22%   24C    P8    15W / 250W |      4MiB / 12212MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   7  GeForce GTX TIT...  On   | 00000000:8A:00.0 Off |                  N/A |
| 22%   27C    P8    15W / 250W |      4MiB / 12212MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     32155      C   python                            514MiB |
|    1   N/A  N/A     32155      C   python                              0MiB |
|    2   N/A  N/A     32155      C   python                              0MiB |
|    3   N/A  N/A     32155      C   python                              0MiB |
|    4   N/A  N/A     32155      C   python                              0MiB |
|    5   N/A  N/A     32155      C   python                              0MiB |
|    6   N/A  N/A     32155      C   python                              0MiB |
|    7   N/A  N/A     32155      C   python                              0MiB |
+-----------------------------------------------------------------------------+

Other information:
OS: Ubuntu 20.04.1 LTS
PyTorch: 1.7.1 (other versions seem to have the issue too)
CUDA version: 11.1
Driver Version: 455.32.00
Hardware: 8x GeForce GTX TITAN X

Could someone please let me know what is going on?

Many thanks,
Fred

ptrblck · January 14, 2021, 8:30am

The CUDA initialization will see all devices and might thus add the small amount of memory. You can hide the other devices by using CUDA_VISIBLE_DEVICES=0 python script.py args to make only certain GPUs visible inside your script.

DzReal · January 14, 2021, 8:56am

Hi @ptrblck

Thanks for the reply. I’m actually using DistributedDataParallel. And because of this issue, each spawned process will occupy a small portion of memory on every GPU, which is causing the whole process to hang.

ptrblck · January 14, 2021, 9:04am

If you are using the recommended mode of one GPU per process, you could use the launch scripts to make sure each process only sees a single GPU.

DzReal · January 14, 2021, 10:16am

Yeah I’m using one GPU per process. But the code is written in a way that the processes spawn in the __main__ function. Although it wouldn’t be hard to modify it to use the launch script, I just really want to know why all CUDA devices are used by one process. It used to be fine, only cuda:0 is occupied when the device id is not specified. But after a system update, I re-installed CUDA and PyTorch (same version). Now it’s behaving like this. It seems very peculiar to me.

DzReal · January 27, 2021, 7:10am

I found the problem. It’s mostly likely a display bug for cuda driver 455.32.00. I upgraded to CUDA11.2 with driver version 460.27.04 and the everything is fine now.

1chimaruGin · January 27, 2021, 7:52am

I’m facing the same problem.

Did you mean nvidia graphic driver?

DzReal · January 27, 2021, 8:07am

Yes. Simply upgrading the CUDA toolkit to 11.2 will automatically upgrade the driver too.