nn.DataParallel doesn't automatically use all GPUs

I have a cluster of 10 GPUs. The default code,

    if torch.cuda.device_count() > 1:
        net = nn.DataParallel(net)

Uses 8 GPUs. Does PyTorch have a max 8 GPU policy? Training number workers = 1 at the moment but this behavior doesn’t change when I tried 1, 2 or 10.

What’s torch.cuda.device_count() return value?

Have you tried:

dev_count = torch.cuda.device_count()
if  dev_count > 1:
    net = nn.DataParallel(net, device_ids=list(range(dev_count)))

Also looks like 8+ GPUs is not handled very well. So it might from torch.cuda.device_count(). If that works for you (you know the number of GPUs) :

dev_count = 10
net = nn.DataParallel(net, device_ids=list(range(dev_count)))
2 Likes

Additionally to what has been said: how large is your batch size and is it divisible by the number of GPUs?

1 Like

Thanks for your reply. The device count is 10. I also tried what you suggested and that didn’t work. @ptrblck was right, silly me, I was using a batchsize of 16. I increased it to 20 and it uses all the GPUs. Thanks guys :slight_smile:

I don’t want to make a new entry for this so just replying to this thread.
MultiGPU keeps crashing with an OSError.

File “ExploreArch.py”, line 245, in
for bt, data in enumerate(trainloader):
File “/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py”, line 819, in next
return self._process_data(data)
File “/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py”, line 846, in _process_data
data.reraise()
File “/usr/local/lib/python3.6/dist-packages/torch/_utils.py”, line 369, in reraise
raise self.exc_type(msg)
OSError: Caught OSError in DataLoader worker process 1.
Original Traceback (most recent call last):
File “/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/worker.py”, line 178, in _worker_loop
data = fetcher.fetch(index)
File “/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/fetch.py”, line 44, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File “/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/fetch.py”, line 44, in
data = [self.dataset[idx] for idx in possibly_batched_index]
File “ExploreArch.py”, line 102, in getitem
OSError: [Errno 5] Input/output error:

This system has 10 P4 cards. I’ve made the following observations:
With num_workers=0, there are no crashes. However, it slows down the processing.
With num_workers=20, I get the exact same error but on a much earlier epoch (say, 2)
With num_workers=10, the code ran smoothly until epoch 34 or so and then crashed.

I’ve seen a few links but the solution isn’t readily obvious? Any suggestions? If not, I’ll move this with num_workers = 0.

The error points to the file read operation, which is a bit weird.
I assume you are reading some files in the __getitem__ method and load the data.
Which kind of files are you using and are you using any custom collate_fn or sampling?

Hi @ptrblck, sorry for the late response. It’s a simple imread of PNG files in __getitem__. I use the default collate function which PyTorch provides. An alternative explanation could be that the file IO speed is low as opposed to processor but I’m guessing PyTorch would have locking mechanisms to handle such situations.

It also seems the master GPU, default device:0 has higher memory requirements because it is used to perform backprop(?). Would you advice a system with, say 8GB GPUsX9 and the 10th one as 16 GB. That way we can maximize the use across the rest of the GPUs. Any advice would be greatly appreciated!

@Rakshit_Kothari did you get the solution to it? as I’m also facing the same issue

Sorry for the late reply. This is a known issue and the google query you’re looking for is GPU memory imbalance. The correct answer is to choose a batchsize which is a multiple of 10. For instance, 10 GPUs should mean a batchsize of 10 or 20, 30 … so on.

Also, you’ll need to increase the number of workers to the capability of your system. Note that there is very little runtime benefit to having 10 GPUs (as my network admin learned after a gazillion hang ups). Data I/O will usually be the bottleneck. The only advantage in a 10 GPU system would be the insanely large batch size, i.e, higher learning rates.

1 Like