Data parallel tutorial

I’m struggling here with a multi-GPU setup with DataParallel, only getting 1.7x boost from 3 GPUs. If someone has any blind suggestions on how that can be improved, they would be appreciated. One thing I’ve noticed is that like 10x more PCIe bandwidth is used by PyTorch than by parallelized Keras+TF. Model is similar. I have very little idea why.

Anyway, I’m viewing this http://pytorch.org/tutorials/beginner/blitz/data_parallel_tutorial.html tutorial and I’m seeing some model.gpu() and similar .gpu() stuff written there and it seems like it’s a mistake made by someone who wanted to write .cuda()?

Seems like it. Without code it is hard to say, why you don’t get more performance!

I’m guessing there may be some general suggestions and reasons. I’m getting a huge PCIe bandwidth even when I train on a single GPU - I think it’s 2x-4x that of Keras+TF. I have no clue why so much data is being sent when basically what seems reasonable to send is simply the input data and algorithmic instructions. I don’t think it’s got much to do with anything I do specifically, but more like what PyTorch is like in general DataParallel performance?

It seems to me like some totally unnecessary stuff is being exchanged between CPU and GPUs and that’s just the way PyTorch currently is.

Hi there,

am just starting with pytorch, but have you tried using the flag: torch.backends.cudnn.benchmark=True. It should help (if you have a static architecture).

1 Like

Thanks, that might help in the future. Didn’t change the speed this time, though, and it won’t influence PCIe traffic. It’s for autoselecting a faster algorithm for particular convolution parameters, as far as I understand.

1 Like

One thing I very much wonder about is the example in the docs

model = Model(input_size, output_size)
if torch.cuda.device_count() > 1:
  print("Let's use", torch.cuda.device_count(), "GPUs!")
  # dim = 0 [30, xxx] -> [10, ...], [10, ...], [10, ...] on 3 GPUs
  model = nn.DataParallel(model)

if torch.cuda.is_available():
   model.cuda()

for data in rand_loader:
    if torch.cuda.is_available():
        input_var = Variable(data.cuda())
    else:
        input_var = Variable(data)

    output = model(input_var)
    print("Outside: input size", input_var.size(),
          "output_size", output.size())

when and where is the data actually transfered? Are we first transferring to say GPU0 with data.cuda() and later back to the CPU and from CPU to the other GPUs when DataParallel takes that input?

That’s quite an important question. And also what about similar stuff about loss, backward pass and loss weights.

When Keras+TF utilizes some 2% of PCIe bandwidth when training on a single GPU, PyTorch will easily chew up 20x that amount.

I’m not sure which of the things that I’ve tweaked did that, I’m guessing, maybe, removal of unpooling with pool indices - I’ll test that later, but PCIe usage has by now dropped for me to same Keras-level 1-3%. And, of course, scaling efficiency went up.

PCIE usage problem is back for me for some reason. Anybody tackled this issue? Running on a single GPU here with no pool indices. PCIE 16x loaded by some 45%. No clue why. If there’s any way to debug and find out what’s being transferred that would be cool.

How are you measuring PCIe bandwidth?