Run Pytorch on Multiple GPUs


(Andre) #1

Hello

Just a noobie question on running pytorch on multiple GPU.
If I simple specify this:

device = torch.device("cuda:0"),

this only runs on the single GPU unit right?

If I have multiple GPUs, and I want to utilize ALL OF THEM. What should I do?
Will below’s command automatically utilize all GPUs for me?

    use_cuda = not args.no_cuda and torch.cuda.is_available()
    device = torch.device("cuda" if use_cuda else "cpu")

#2

Wrapping your model in nn.DataParallel is an easy way to use your GPUs.
Have a look at the parallelism tutorial.


(Andre) #3

Yes, I have browsed through the topic. But I didn’t find info answering the multiple GPUs question


#4

This tutorial might explain it better. Let me know, if this helps you.


(KanZa ) #5

Hi @ptrblck

I am trying to run this project: https://github.com/chaoyuaw/pytorch-coviar/blob/master/train.py with multiple GPUs. It stops after uploading the videos. Any suggestion?


#6

Does this code run with a single GPU?
If so, could you try to set num_workers=0 for the DataLoaders in the multi GPU setup and try it again?


(KanZa ) #7

Yes it is working with single GPU but when the training are with resnet18 architecture and less batch size not with resnet152 and original batch size described by author.


#8

So using a single GPU the code also get’s stuck for resnet152 and the original batch size?


(KanZa ) #9

yes you are getting me correctly.


(KanZa ) #10

Do you mean in train.py author is using dataloader for val_loader and train_loader which is like this:

batch_size=args.batch_size, shuffle=False, num_workers=args.workers, pin_memory=True)

I shall change both val_loader and train_loader like this:

batch_size=args.batch_size, shuffle=False,
num_workers=0, pin_memory=True)

Sorry I am new in deep learning and Pytorch


#11

Yes, I meant exactly this line of code. :wink:
Could you try that, although your error seems to be a bit strange as resnet18 is running while resnet152 gets stuck.


(KanZa ) #12

The author has used Tesla P100 GPU (FYI).

It has given error I have also tried with both of these too with changes you have described.

model = torch.nn.DataParallel(model, device_ids=None).cuda()

model = torch.nn.DataParallel(model, device_ids=args.gpus).cuda()


#13

Your GPU is out of memory. Probably the model is just too large for your GPU.
You could try to use torch.utils.checkpoint to trade compute for memory.


(KanZa ) #14

OK Thank you so much for your help.