DataParallel(model) Multi- vs Single-GPU

aizawak · January 27, 2018, 4:10am

I have a model and criterion:

model = torch.nn.DataParallel(model).cuda()
criterion = nn.TripletMarginLoss(margin=0.2, p=2).cpu()

and I run the forward pass:

input_var = torch.autograd.Variable(input)
output = model(input_var)

and other Autograd Functions:

normalized_output = F.normalize(output, p=2, dim=1).cpu()
anchor, positive, negative = TripletSample.apply(normalized_output, target_var)
loss = criterion(anchor, positive, negative)

In the single GPU case (where DataParallel is not used), the network trains as expected. In the multi-GPU case, the network (all else same) does not learn anything.

UPDATE: It doesn’t look like the activations/gradients are the same when run on one gpu vs all gpu’s. The difference is on the order of 1e-2 which seems like a lot… Is this caused by moving onto cpu from gpu and back?

I expect the model to run as follows:

replicate model, split data (gpu)
forward model in parallel (gpu)
gather outputs (gpu)
move to cpu (cpu)
normalize on outputs (cpu)
tripletsample on outputs (cpu)
loss on outputs (cpu)
backward until back at DataParallel (cpu)
scatter (gpu)
backward on rest of model (gpu)

aizawak · January 28, 2018, 1:11am

UPDATE:

so I’ve just tested the forward and backward passes and it looks like the activations and gradients are roughly the same, but they are on the order of 1e-2 to 1e-3 different between the single gpu and multi-gpu case… Could this be the problem?

ezyang · January 28, 2018, 5:01am

DataParallel will chunk your data across the batch dimension so that it can send separate chunks to each GPU, which can explain the difference in gradients. Without seeing more of your model it is hard to say why this causes it not to train anymore.

aizawak · January 28, 2018, 5:31am

It is a resnet model.

Each GPU has different mini batch mean and var for BatchNorm, but I fixed them by setting it to eval() mode just for the purpose of figuring out what’s going wrong.

Still the activations and gradients were similar but not the same between single and multi gpu on the order of 1e-2.

afshin67 · April 12, 2019, 5:45pm

Hi @aizawak ,
Can you please share your code?