DDP RuntimeError: replicas[0][64] in this process with sizes [688, 64] appears not to match sizes of the same param in process 0

I am getting this replicas error today.

Machine : HPC cluster, GPU queue, Single node.
pytorch 1.10.2
Number of GPU devices available: 2
Names of GPU devices: [‘Tesla V100-PCIE-16GB’, ‘Tesla V100-PCIE-16GB’]
The model training was working well with ddp on 1 gpu .
But gives this error with multiple gpu

RuntimeError: replicas[0][64] in this process with sizes [688, 64] appears not to match sizes of the same param in process 0.

I tried this:

 for param in model.parameters():
            print("param ",param.size())

I realized that the param sizes were equal until the last 4:
The first gpu

param  torch.Size([688, 64])
param  torch.Size([688])
param  torch.Size([717, 64])
param  torch.Size([688, 64])

The second gpu

param  torch.Size([699, 64])
param  torch.Size([699])
param  torch.Size([730, 64])
param  torch.Size([699, 64])

What can I do to my model to make the model parameters equal on all devices?

 transformer = Seq2SeqTransformer(num_encoder_layers=2,
                                        num_decoder_layers = 2,
                                        emb_size = 64,
                                        nhead = 2, 
                                        tgt_vocab_size =TGT_VOCAB_SIZE, 
                                        dropout = 0.2)

model = transformer.to(rank)
model = DDP(model,device_ids =[rank],find_unused_parameters=True)


DDP should fall back to a standard single GPU use case if only one device is detected.

Could you check where these parameters are defined and speculate why they would change their shape?
Are you initializing some parameters lazily during the forward pass?

Thanks, I did check through and the following variables are changing their values

SRC_VOCAB_SIZE = len(vocab_transform[SRC_LANGUAGE])
TGT_VOCAB_SIZE = len(vocab_transform[TGT_LANGUAGE])

On the first GPU


On the second GPU


This arises due to my script running the number of GPU times, this means all my preprocessing steps and import statements run 1+ number of GPU times.

I tested it with constant values to the variables it works fine. I would like to know why my preprocessing steps are running multiple times. Is there something I am doing wrong?

Each process will run the script you are launching. Assuming your data loading is in the main method, each process is expected to load and process its own Dataset. This is also the reason why a DistributedRandomSampler is used, which avoids sample duplication.
Make sure that each rank sees the full vocab so that the models are initialized in the same way.