Here is the situation. A customized DataLoader is used to load the train/val/test data.
The model can be launched on single GPU, but not multiples.
class EncoderDecoder(torch.nn.Module): def forward(feats, masks,...) clip_masks = self.clip_feature(masks, feats) .... def clip_feature(self, masks, feats): ''' This function clips input features to pad as same dim. ''' max_len = masks.data.long().sum(1).max() print('max_len:%d' % max_len) masks = masks[:, :max_len].contiguous() .... return masks ...... def train(opt): model = EncoderDecoder(opt) # setting-1 cuda_model = model.cuda().train() # setting-2 # cuda_model = torch.nn.DataParallel(model.cuda()) cuda_model.train() torch.cuda.synchronize() ...
If I launch the model on single GPU as marked as “setting-1”, it works but lasts days. The corresponding returned tensors in clip_features is as expected. The debug info is given as follows:
masks.shape (150, 61) EncoderDecoder clip_feature masks.shape in (150, 61) masks.device:cuda:0 max_len:61 masks.shape clip_att (150, 61) max_len:61 masks.size (150, 61) att_mask.device cuda:0
Instead of running on single gpu, I use DataParallel, indicated as “setting-2”, the results are changed somehow,
EncoderDecoder clip_feature masks.shape in (38, 61) masks.device:cuda:0 masks.shape (38, 61) EncoderDecoder clip_feature masks.shape in (38, 61) masks.device:cuda:1 masks.shape (38, 61) RelationTransformer clip_feature att_masks.shape in (38, 61) masks.device:cuda:2 max_len:50 max_len:50
It posts the runtime error later for multiplication I intend to have:
RuntimeError: The size of tensor a (61) must match the size of tensor b (60) at non-singleton dimension 3
I have no idea how it happens. The batched input dispatched on different devices, but the results are totally different with the one returned by a single GUP. I do not think that it depends on the parallel dispatching on GPUs. Maybe I missed some configurations for my model. The running environment looks like as follows(I tested it with different torch visions):
- torch 0.4.1 / 1.4.0+cu100
- torchvision 0.2.1/ 0.5.0+cu100
- 4 x Tesla V100-SXM2 Driver Version: 410.104 CUDA Version: 10.0
Hoping any inputs to help me out. Thanks.