I only get wrong results in the last epoch. Here are some logs:
log1:
Epoch 97 train loss 0.892 acc: 77.805%; val loss: 0.930 acc 77.010%; use 8.296 mins.
Epoch 98 train loss 0.887 acc: 77.922%; val loss: 0.931 acc 77.024%; use 8.305 mins.
Epoch 99 train loss 0.422 acc: 38.989%; val loss: 0.459 acc 38.506%; use 8.300 mins.
All metrics are 4/8 of expected. It seems that results from 4 GPUs are 0.
log2:
Epoch 96 train loss 0.973 acc: 75.933%; val loss: 0.967 acc 76.188%; use 9.449 mins.
Epoch 97 train loss 0.969 acc: 76.003%; val loss: 0.967 acc 76.148%; use 9.459 mins.
Epoch 98 train loss 0.969 acc: 76.029%; val loss: 0.967 acc 76.228%; use 9.445 mins.
Epoch 99 train loss 1.333 acc: 104.523%; val loss: 1.326 acc 104.876%; use 9.452 mins.
All metrics are 11/8 of expected: 1.333 / (11/8)=0.969. It seems that results from 3 GPUs are repeated in all reduce. The strange results only happen in the final epoch. What could be the possible reasons?
Hey @KaiHoo can you print the reduce_tensor before you pass it to all_reduce, so that we can narrow down whether it is the all_reduce or the DDP training/testing that’s mal-bahaving.
@iffiX@mrshenli Hello, sorry for the late reply. Here are part of the codes:
Epoch 95 train loss 1.056 acc: 74.176%; val loss: 0.954 acc 75.958%; use 12.457 mins.
Epoch 96 train loss 1.048 acc: 74.339%; val loss: 0.949 acc 75.998%; use 12.459 mins.
Epoch 97 train loss 1.028 acc: 74.815%; val loss: 0.946 acc 76.232%; use 12.455 mins.
Epoch 98 train loss 1.027 acc: 74.855%; val loss: 0.946 acc 76.236%; use 12.475 mins.
Rank 5 train loss 1.026 acc: 74.890%; val loss: 0.972 acc 75.600%
Rank 6 train loss 1.025 acc: 74.815%; val loss: 0.924 acc 76.352%
Rank 3 train loss 1.025 acc: 74.889%; val loss: 0.957 acc 75.632%
Rank 1 train loss 1.032 acc: 74.757%; val loss: 0.929 acc 76.960%
Rank 7 train loss 1.023 acc: 75.038%; val loss: 0.930 acc 76.512%
Rank 2 train loss 1.019 acc: 75.013%; val loss: 0.958 acc 76.144%
Exception ignored in: <function WeakValueDictionary.__init__.<locals>.remove at 0x7ffb5070eea0>
Traceback (most recent call last):
File "/usr/lib/python3.5/weakref.py", line 117, in remove
TypeError: 'NoneType' object is not callable
Rank 0 train loss 1.022 acc: 74.841%; val loss: 0.951 acc 76.304%
Epoch 99 train loss 0.513 acc: 37.451%; val loss: 0.470 acc 38.097%; use 12.467 mins.
Rank 4 train loss 1.023 acc: 74.984%; val loss: 0.947 acc 76.304%
Since the strange results only happen in the final epoch, I only print the metrics for the last epoch. The order of logs are exactly what I got, though the ‘Rank 4’ line should be printed before the ‘Epoch 99 train’ line: