A similar problem on stackoverflow is here, but no answer is useful.
As this tutorial shown, the output of multi-gpus will be concatenated on the dimension 0, but I don’t know why does it not work in my code.
model = T2T(......) # T2T is a sub class of nn.Module
if torch.cuda.device_count() > 1:
print("Using", torch.cuda.device_count(), "GPUs!")
model = torch.nn.DataParallel(model)
model = model.to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)
train_loader = DataLoader(......)
......
epoch_steps = len(train_loader)
STEPS = train_epochs * epoch_steps
train_loader = cycle(train_loader)
......
for step in range(STEPS):
batch_data = next(train_loader)
optimizer.zero_grad()
labels = batch_data['labels'].to(device)
......
probs = model(......)
# calculate loss using labels and probs
When calculating loss, I got an error, the batch size of labels and probs are not the same. The shape of labels is [64,…] and the shape of probs is [32,…]. I am using 2 gpus, so I guess the output of multi-gpus are not concatenated.
What is the shape of the input? The nn.DataParallel model would split the input in dim0 and concatenate it afterwards. If the input has a batch size of 32, the output would have the same shape.
Sorry for the late reply, I have already fixed the problem but the actual reason remains unknown. The model has many params as above, but the first dimensions of them are all batch_size, and the current value is 64.
One exception is the sequence_mask tensor, it has shape [1, n, n] and the elements on and below the main diagonal are all zeros (it is used to do the sequence masking in multi-head attention). The batch_size of it is set to 1 because I have noticed that DataParallel model will split the input in dim0, and I hope all devices will get the same sequence_mask.
After I change the first dimension of sequence_mask to batch_size(64 for now), the problem is fixed. Specifically, I use tensor.repeat(batch_size, 1, 1) instead of tensor.unsqueeze(0) to generate the batch_size dimension as before. After that, the first dimension of the output of the DataParallel model (probs in the code) becomes 64 instead of 32. I don’t know the exact reason due to my unfamiliarity of DataParallel, it would be so helpful if you could tell me, thanks a lot.
Your description is right. nn.DataParallel expects all input tensors in the shape [batch_size, *] and will split them in dim0. If you’ve applied broadcasting in the past, repeating the tensor sounds reasonable.
But I still don’t know why can I only get output from one gpu (instead of the concatenation of outputs from all gpus) if I pass a tensor with shape [1, *]
This tensor won’t be split, as dim0 has only a single sample and thus only a single GPU will execute this tensor.
If you want to make sure each GPU gets a chunk of the input tensor, make sure dim0 has at least a size of num_gpus.