I have one PC with two GPUs on board. Below is schematic code for distributed training:
dist.init_process_group(backend="nccl", .....) ..... model = DistributedDataParallel(model, device_ids=[rank]) # rank is 0 or 1 ..... for i in range(len(batches)) ..... outputs = model(inputs, .....) loss = criterion(outputs, ....) loss.backward() optimizer.step() print(rank, "-", i) .....
I expected to see following print outputs (to save space, I quote them horizontally):
0-0 1-0, 0-1 1-1, 0-2 1-2, 0-3 1-3, ..., 0-n 1-n
However I got something unexpected:
0-0 1-0, 0-1 1-1, ..., 0-60 1-62, ..., 0-3125 1-3145, ...
Sometimes gpu_1 is up by 20 iterations ahead of gpu_0!
I see two possibilities here:
- my code has error somewhere
- it is normal behavior for synchronization
If 2. is the case could you please explain why is it so?
From what I’ve understood in documentation, synchronization between processes happens at each iteration in loss.backward() call. Suppose I have some model parameter w. At each iteration this parameter must be the same in gpu_0 and gpu_1 model replicas.
at first iteration w is:
w + (upd0_0 + upd1_0)/2
at second iteration:
w + (upd0_0 + upd1_0)/2 + (upd0_1 + upd1_1)/2
at i-th iteration:
w + (upd0_0 + upd1_0)/2 + (upd0_1 + upd1_1)/2 + ... + (upd0_i + upd1_i)/2
upd1_i- parameter updates calculated during backprop at i-th iteration in gpu_0 and gpu_1 correspondingly.