Hi! I’m trying to set up multi-GPU training for my sort of complicated model, but I’m not sure if I’m pursuing the correct direction. I’ve been reading tutorials including this and this, but I just realized according to comments in the class, DDP is splitting the first dimension of input, while I have a dictionary as input. My model is like a multi-encoder-decoder framework over entities and relations, where we have a number of encoders for entities and a number of pair-wise decoders for relations, and the objective is link prediction. Each batch of input would be a dictionary like

```
{
'tokens':{
'entity_A': some_tensor, # size is [num_A, seq_len_A]
'entity_B': some_tensor, # size is [num_B, seq_len_B]
...
},
'pairs':{
'A_B': some_tensor, # size is [num_AB, 3], 3 = head, relation, tail
'A_C': some_tensor, # size is [num_AC, 3], 3 = head, relation, tail
...
}
}
```

I’m thinking of manually requiring DDP to only split the relations to each device, and all entities in the batch will be input to the `forward`

. Then, in the model, we only keep the entities that those relations encompass (including remapping the indices). I’m not sure if this is the best way though, and am not familiar with how to “require DDP to only split certain inputs to `forward`

” either.

I am not using `pytorch-geometric`

, but I am looking at their usage of DDP here. It looks like one can customize how data is fed into each machine in `DistributedSampler`

(?) and DDP isn’t directly truncating the multiple inputs in the same way. But I’m still a bit unsure how DDP would work in those cases. Any pointers & references will be greatly appreciated!