I am getting this replicas error today.
Setup:
Machine : HPC cluster, GPU queue, Single node.
pytorch 1.10.2
Number of GPU devices available: 2
Names of GPU devices: [‘Tesla V100-PCIE-16GB’, ‘Tesla V100-PCIE-16GB’]
The model training was working well with ddp on 1 gpu .
But gives this error with multiple gpu
RuntimeError: replicas[0][64] in this process with sizes [688, 64] appears not to match sizes of the same param in process 0.
I tried this:
for param in model.parameters():
print("param ",param.size())
I realized that the param sizes were equal until the last 4:
The first gpu
param torch.Size([688, 64])
param torch.Size([688])
param torch.Size([717, 64])
param torch.Size([688, 64])
The second gpu
param torch.Size([699, 64])
param torch.Size([699])
param torch.Size([730, 64])
param torch.Size([699, 64])
What can I do to my model to make the model parameters equal on all devices?
transformer = Seq2SeqTransformer(num_encoder_layers=2,
num_decoder_layers = 2,
emb_size = 64,
nhead = 2,
src_vocab_size=SRC_VOCAB_SIZE,
tgt_vocab_size =TGT_VOCAB_SIZE,
dim_feedforward=64,
dropout = 0.2)
model = transformer.to(rank)
model = DDP(model,device_ids =[rank],find_unused_parameters=True)
Thanks