I have a model and criterion:
model = torch.nn.DataParallel(model).cuda()
criterion = nn.TripletMarginLoss(margin=0.2, p=2).cpu()
and I run the forward pass:
input_var = torch.autograd.Variable(input)
output = model(input_var)
and other Autograd Functions:
normalized_output = F.normalize(output, p=2, dim=1).cpu()
anchor, positive, negative = TripletSample.apply(normalized_output, target_var)
loss = criterion(anchor, positive, negative)
In the single GPU case (where DataParallel is not used), the network trains as expected. In the multi-GPU case, the network (all else same) does not learn anything.
UPDATE: It doesn’t look like the activations/gradients are the same when run on one gpu vs all gpu’s. The difference is on the order of 1e-2 which seems like a lot… Is this caused by moving onto cpu from gpu and back?
I expect the model to run as follows:
- replicate model, split data (gpu)
- forward model in parallel (gpu)
- gather outputs (gpu)
- move to cpu (cpu)
- normalize on outputs (cpu)
- tripletsample on outputs (cpu)
- loss on outputs (cpu)
- backward until back at DataParallel (cpu)
- scatter (gpu)
- backward on rest of model (gpu)