This is a separate issue potentially related to: Sending a tensor to multiple GPUs
I am training a DataParallel module on two GPUs. The training works as it should when:
a) Training on a single GPU, where the model is not wrapped by the DataParallel Module, regardless of batch size.
b) Training with both GPUs available, but with batch size = 1 so the data is sent to only one GPU.
However, when I increase the batch size to >1 and utilize both GPUs, the program gets stuck on the backward pass, loss.backward()
. The rest of the training code (somewhat streamlined) can be found below.
Is this a hardware issue (as suggested in a comment here: Sending a tensor to multiple GPUs), or something to do with autograd (all variables have gradients, so don’t think that is the issue)?
for batch_idx, (x, y, graph, subject) in enumerate(self.train_loader):
if model.module.subject != subject:
model.module.subject = subject
model.module.graph = graph
output = model(x.to(self.device))
target = torch.argmax(y, dim=1)
optimizer.zero_grad()
loss = F.nll_loss(output, target, weight=self.w)
loss.backward()
optimizer.step()