I built a network using customized layers. It runs fine on a single GPU but crashes when using two GPUs of a server. The codes and error messages are shown below. It seems that one of the tensors was split into the 2 GPUs while the other was not. Was it caused by the customized forward function? How should I solve it? Thanks!
os.environ['CUDA_VISIBLE_DEVICES'] = '0,1'
def __init__(self, in_dim=Fdim, out_dim=Fdim, bias=True):
super(Cov1, self).__init__(in_dim, out_dim, bias)
simcov1 = torch.zeros(seq.shape).cuda()
for i in range(0,self.in_dim):
SeqDist = Vsets(seq[:,i].unsqueeze(1))
simcov1[:, i] = torch.mean(SeqDist * sum_idx, 1)
simcov1 = 1 - simcov1
if self.bias is not None:
mean_dist = simcov1.matmul(self.weight) + self.bias
mean_dist = simcov1.matmul(self.weight)
If you are using
DataParallel the assumption is all input tensors have the same dimension along the first (batch) dimension. Otherwise the splitting behavior becomes tricky to reason about.
What are the input shapes (and the meaning of the dimensions) being passed and is
DataParallel being used?
Thanks eqy. Yes, DataParallel is used for the model:
model = nn.DataParallel(model).cuda(). And the dimension of the first tensor
SeqDist is 446x446 and the second tensor
sum_idx is 446. The multiplication
SeqDist * sum_idx is to select the rows specified in
sum_idx and calculate each row’s average.
If the first dimension of the tensors changes, I don’t know how to make the multiplication work…
In this case, can you simply make this data parallel by doing something like making seqdist (N, 446, 446) and sum_idx (N, 446)?
It works~ Thanks a lot~ I revised my coding to meet the splitting mechanism.