Batch_size does NOT equal after dataparallel forward

(James Jing Tang) #1

Hi, I trained a model on multi gpu using dataparallel, but at last batch, it got ValueError: Expected input batch_size to match target batch_size, anyone knows why?

model = torch.nn.DataParallel(model, device_ids=[0,1,2])
criterion = torch.nn.CrossEntropyLoss()
inputs = Batch.inputs.cuda()
targets = Batch.targets.cuda()
assert inputs.size(0) == targets.size(0)
logits = model(inputs)
loss = criterion(logits, targets)


Could you print the shapes of the tensors yielding this error?
To get rid of this error, you could specify drop_last=True in your DataLoader for now.

I guess the last batch might have a single sample which might get squeezed somewhere, so that a size mismatch occurs.

(James Jing Tang) #3

the shape of TARGETS tensor in last yield batch is (12), and shape of INPUTS tensor is (12,128,512). But the shape of LOGITS tensor is (8, 311)