Cuda:0 gets full even with using four gpus

I have four GPUs, and I am using nn.DataParallel and passing the all ids of four GPUs to it. However, the first GPU only gets full(32 out of 32) and others are still empty (7 out of 32). Is that because the first GPU acts as a master and aggerates the gradients? or that’s not the usual? I attached a snapshot of nvidia-smi:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:18:00.0 Off |                    0 |
| N/A   44C    P0    63W / 300W |  32134MiB / 32768MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  On   | 00000000:3B:00.0 Off |                    0 |
| N/A   35C    P0    56W / 300W |   6656MiB / 32768MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  On   | 00000000:86:00.0 Off |                    0 |
| N/A   35C    P0    55W / 300W |   6524MiB / 32768MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  On   | 00000000:AF:00.0 Off |                    0 |
| N/A   38C    P0    55W / 300W |   7052MiB / 32768MiB |      0%      Default |
|                               |                      |                  N/A |

Yes, this might be the reason as nn.DataParallel is known to create an imbalanced usage. Try to use DistributedDataParallel which won’t suffer from this issue.

1 Like

I used DistributedDataParallel as follows:

model.cuda(local_rank)
model = torch.nn.parallel.DistributedDataParallel(model)

Also, i initialized the Data Parallelism communications as follows

rank = 0
dist.init_process_group(backend='nccl', init_method='tcp://127.0.0.1:3456', world_size=1, rank=rank)

But still the first GPU only utilized:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:18:00.0 Off |                    0 |
| N/A   46C    P0    60W / 300W |  32284MiB / 32768MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  On   | 00000000:3B:00.0 Off |                    0 |
| N/A   32C    P0    42W / 300W |      3MiB / 32768MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  On   | 00000000:86:00.0 Off |                    0 |
| N/A   32C    P0    41W / 300W |      3MiB / 32768MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

In that case you might want to check your script and see if you are explicitly pushing data to the default device (i.e. cuda:0) instead of the current rank.

Should I modify the world_size parameter to be equal to three instead of one because I have three GPUs?

The world_size indicates the number of participating processes and in the simplest DDP use case corresponds to the number of GPUs, so yes you should change it in case it’s set to 1.

Also, you might want to check this tutorial which explains the DDP usage in more detail.