Run Pytorch on Multiple GPUs

andrew_su · July 9, 2018, 8:36pm

Hello

Just a noobie question on running pytorch on multiple GPU.
If I simple specify this:

device = torch.device("cuda:0"),

this only runs on the single GPU unit right?

If I have multiple GPUs, and I want to utilize ALL OF THEM. What should I do?
Will below’s command automatically utilize all GPUs for me?

    use_cuda = not args.no_cuda and torch.cuda.is_available()
    device = torch.device("cuda" if use_cuda else "cpu")

ptrblck · July 9, 2018, 8:38pm

Wrapping your model in nn.DataParallel is an easy way to use your GPUs.
Have a look at the parallelism tutorial.

andrew_su · July 9, 2018, 8:50pm

Yes, I have browsed through the topic. But I didn’t find info answering the multiple GPUs question

ptrblck · July 9, 2018, 9:06pm

This tutorial might explain it better. Let me know, if this helps you.

KanZa · September 11, 2018, 1:52am

Hi @ptrblck

I am trying to run this project: pytorch-coviar/train.py at master · chaoyuaw/pytorch-coviar · GitHub with multiple GPUs. It stops after uploading the videos. Any suggestion?

ptrblck · September 11, 2018, 8:05am

Does this code run with a single GPU?
If so, could you try to set num_workers=0 for the DataLoaders in the multi GPU setup and try it again?

KanZa · September 11, 2018, 8:19am

Yes it is working with single GPU but when the training are with resnet18 architecture and less batch size not with resnet152 and original batch size described by author.

ptrblck · September 11, 2018, 8:22am

So using a single GPU the code also get’s stuck for resnet152 and the original batch size?

KanZa · September 11, 2018, 8:22am

yes you are getting me correctly.

KanZa · September 11, 2018, 8:25am

Do you mean in train.py author is using dataloader for val_loader and train_loader which is like this:

batch_size=args.batch_size, shuffle=False, num_workers=args.workers, pin_memory=True)

I shall change both val_loader and train_loader like this:

batch_size=args.batch_size, shuffle=False,
num_workers=0, pin_memory=True)

Sorry I am new in deep learning and Pytorch

ptrblck · September 11, 2018, 8:28am

Yes, I meant exactly this line of code.
Could you try that, although your error seems to be a bit strange as resnet18 is running while resnet152 gets stuck.

KanZa · September 11, 2018, 8:50am

The author has used Tesla P100 GPU (FYI).

It has given error I have also tried with both of these too with changes you have described.

model = torch.nn.DataParallel(model, device_ids=None).cuda()

model = torch.nn.DataParallel(model, device_ids=args.gpus).cuda()

ptrblck · September 11, 2018, 9:06am

Your GPU is out of memory. Probably the model is just too large for your GPU.
You could try to use torch.utils.checkpoint to trade compute for memory.

KanZa · September 11, 2018, 9:19am

OK Thank you so much for your help.

DoubtWang · December 23, 2018, 1:34pm

Hi,
I have a look at data_parallel_tutorial and parallelism_tutorial.
And I have write my code, but fail to use GPUs.

if opt.gpus = [0, 1]

class Model(nn.Module):
  def sample(self, input):
    input = input.cuda()
    output = self.other_function(input)
    return output

if len(opt.gpus) > 1:
    print("Let's use", len(opt.gpus), "GPUs!")
    os.environ["CUDA_VISIBLE_DEVICES"] = ','.join(str(x) for x in opt.gpus)
else:
    torch.cuda.set_device(opt.gpus[0])
torch.cuda.manual_seed(opt.seed)

model = Model()
if len(opt.gpus) > 1:
    model = nn.DataParallel(model, device_ids=opt.gpus, dim=0)
if use_cuda:
    model = model.cuda()
 
model.train()
out = model.module.sample(data)

data.shape : (Batch_size, time_step)

However, model only use the gpus 0 when I run my code .
I don’t know why ?
Could you explain it ?
Thanks.

ptrblck · December 23, 2018, 1:43pm

How large is your batch size?
Using nn.DataParallelyour data will be split in dim0 and each chunk of data will be send to a device.
If the batch size is too small, some GPU won’t be able to get a data chunk.

DoubtWang · December 23, 2018, 3:41pm

Thank you for the quick reply (appreciated).
My batch_size = 8,
because the length of train_data sample is very long.
I will get the error message ( out of memory ) when I only use one gpu.
So, I plan to use the multi-gpus to train my model.
However, the model only use the gpus 0 when gpus= [0, 1].

In general, batch_size be equal to ?
Could you give me some advice, thank you.

ptrblck · December 23, 2018, 4:21pm

If your batch size is 8, each GPU should get 4 samples each.
Do you see the GPU1 completely empty without any utilization in nvidia-smi or a similar tool?

DoubtWang · December 24, 2018, 12:49am

Sorry, I have been late because of the time zone.

I use the watch -n 1 -d nvidia-smi when I run my model.
However, I see the the GPU1 completely empty.
I am confused why this happens.
Can you explain it in your experience?
Thanks.

Besides,
I output the data tensor.shape(time_step, Batch_size_data, embedding_dim)
before out = model.module.sample(data).
And I also output the output tensor.shape(time_step, Batch_size_output, hidden_dim) after the other_function.

class Model(nn.Module):
        def sample(self, input):
               input = input.cuda()
               output = self.other_function(input)
               return output

However, the Batch_size_data == Batch_size_output.
In theory, I think Batch_size_data should be equal to 2*Batch_size_output when using the GPU0 and GPU1.

Note that I modified parameter setting
model = nn.DataParallel(model, device_ids=opt.gpus, dim=1)

DoubtWang · December 25, 2018, 12:56am

I found that just modifying the function name will solve my problem.

I changed this part:

class Model(nn.Module):
  def forward(self, input):
    input = input.cuda()
    output = self.other_function(input)
    return output

model.train()
out = model(data)