Here is the situation. A customized DataLoader is used to load the train/val/test data.
The model can be launched on single GPU, but not multiples.
class EncoderDecoder(torch.nn.Module):
def forward(feats, masks,...)
clip_masks = self.clip_feature(masks, feats)
....
def clip_feature(self, masks, feats):
'''
This function clips input features to pad as same dim.
'''
max_len = masks.data.long().sum(1).max()
print('max_len:%d' % max_len)
masks = masks[:, :max_len].contiguous()
....
return masks
......
def train(opt):
model = EncoderDecoder(opt)
# setting-1
cuda_model = model.cuda().train()
# setting-2
# cuda_model = torch.nn.DataParallel(model.cuda())
cuda_model.train()
torch.cuda.synchronize()
...
If I launch the model on single GPU as marked as “setting-1”, it works but lasts days. The corresponding returned tensors in clip_features is as expected. The debug info is given as follows:
masks.shape (150, 61)
EncoderDecoder clip_feature masks.shape in (150, 61)
masks.device:cuda:0
max_len:61
masks.shape clip_att (150, 61)
max_len:61
masks.size (150, 61)
att_mask.device cuda:0
Instead of running on single gpu, I use DataParallel, indicated as “setting-2”, the results are changed somehow,
EncoderDecoder clip_feature masks.shape in (38, 61)
masks.device:cuda:0
masks.shape (38, 61)
EncoderDecoder clip_feature masks.shape in (38, 61)
masks.device:cuda:1
masks.shape (38, 61)
RelationTransformer clip_feature att_masks.shape in (38, 61)
masks.device:cuda:2
max_len:50
max_len:50
It posts the runtime error later for multiplication I intend to have:
RuntimeError: The size of tensor a (61) must match the size of tensor b (60) at non-singleton dimension 3
I have no idea how it happens. The batched input dispatched on different devices, but the results are totally different with the one returned by a single GUP. I do not think that it depends on the parallel dispatching on GPUs. Maybe I missed some configurations for my model. The running environment looks like as follows(I tested it with different torch visions):
- torch 0.4.1 / 1.4.0+cu100
- torchvision 0.2.1/ 0.5.0+cu100
- 4 x Tesla V100-SXM2 Driver Version: 410.104 CUDA Version: 10.0
Hoping any inputs to help me out. Thanks.