A similar problem on stackoverflow is here, but no answer is useful.
As this tutorial shown, the output of multi-gpus will be concatenated on the dimension 0, but I don’t know why does it not work in my code.
model = T2T(......) # T2T is a sub class of nn.Module
if torch.cuda.device_count() > 1:
print("Using", torch.cuda.device_count(), "GPUs!")
model = torch.nn.DataParallel(model)
model = model.to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)
train_loader = DataLoader(......)
epoch_steps = len(train_loader)
STEPS = train_epochs * epoch_steps
train_loader = cycle(train_loader)
for step in range(STEPS):
batch_data = next(train_loader)
labels = batch_data['labels'].to(device)
probs = model(......)
# calculate loss using labels and probs
When calculating loss, I got an error, the batch size of
probs are not the same. The shape of
labels is [64,…] and the shape of
probs is [32,…]. I am using 2 gpus, so I guess the output of multi-gpus are not concatenated.
Any idea how to fix this? Thanks in advance!
What is the shape of the input? The
nn.DataParallel model would split the input in
dim0 and concatenate it afterwards. If the input has a batch size of 32, the output would have the same shape.
probs = model(inputrulelist, syn_inputrulelist, tree_path_vec, rule_mask, syn_rule_mask, inputrulelistnode, syn_inputrulelistnode, inputrulelistson, syn_inputrulelistson, sequence_mask, treemask, syn_treemask, path_lens)
Sorry for the late reply, I have already fixed the problem but the actual reason remains unknown. The model has many params as above, but the first dimensions of them are all
batch_size, and the current value is
One exception is the
sequence_mask tensor, it has shape
[1, n, n] and the elements on and below the main diagonal are all zeros (it is used to do the sequence masking in multi-head attention). The
batch_size of it is set to
1 because I have noticed that
DataParallel model will split the input in
dim0, and I hope all devices will get the same
After I change the first dimension of
batch_size(64 for now), the problem is fixed. Specifically, I use
tensor.repeat(batch_size, 1, 1) instead of
tensor.unsqueeze(0) to generate the
batch_size dimension as before. After that, the first dimension of the output of the
DataParallel model (
probs in the code) becomes
64 instead of
32. I don’t know the exact reason due to my unfamiliarity of
DataParallel, it would be so helpful if you could tell me, thanks a lot.
Your description is right.
nn.DataParallel expects all input tensors in the shape
[batch_size, *] and will split them in
dim0. If you’ve applied broadcasting in the past, repeating the tensor sounds reasonable.
But I still don’t know why can I only get output from one gpu (instead of the concatenation of outputs from all gpus) if I pass a tensor with shape [1, *]
This tensor won’t be split, as
dim0 has only a single sample and thus only a single GPU will execute this tensor.
If you want to make sure each GPU gets a chunk of the input tensor, make sure
dim0 has at least a size of
Oh I get it, thank you for your time!