Hello, I use DDP module to train ImageNet. To collect training metrics from different GPUs, I use distributed.all_reduce. Here are some related codes:

```
local_rank = args.local_rank
torch.cuda.set_device(local_rank)
device = torch.device("cuda", local_rank)
for epoch in range(args.num_epoch + args.warmup_epoch):
start = time.time()
train_loss, train_acc = utils.train_one_epoch(net, train_loader, criterion, optimizer, scheduler)
val_loss, val_acc = utils.test_one_epoch(net, val_loader, criterion)
#train_loss, train_acc, val_loss, val_acc are floating numbers
reduce_tensor = torch.tensor([train_loss, train_acc, val_loss, val_acc]).to(device)
torch.distributed.all_reduce(reduce_tensor)
reduce_tensor /= args.num_gpus
# args.num_gpus = 8
time_used = (time.time() - start) / 60.
if local_rank == 0:
print('Epoch %d train loss %.3f acc: %.3f%%; val loss: %.3f acc %.3f%%; use %.3f mins.'%
(epoch, reduce_tensor[0], reduce_tensor[1], reduce_tensor[2], reduce_tensor[3], time_used))
```

I only get wrong results in the last epoch. Here are some logs:

log1:

```
Epoch 97 train loss 0.892 acc: 77.805%; val loss: 0.930 acc 77.010%; use 8.296 mins.
Epoch 98 train loss 0.887 acc: 77.922%; val loss: 0.931 acc 77.024%; use 8.305 mins.
Epoch 99 train loss 0.422 acc: 38.989%; val loss: 0.459 acc 38.506%; use 8.300 mins.
```

All metrics are 4/8 of expected. It seems that results from 4 GPUs are 0.

log2:

```
Epoch 96 train loss 0.973 acc: 75.933%; val loss: 0.967 acc 76.188%; use 9.449 mins.
Epoch 97 train loss 0.969 acc: 76.003%; val loss: 0.967 acc 76.148%; use 9.459 mins.
Epoch 98 train loss 0.969 acc: 76.029%; val loss: 0.967 acc 76.228%; use 9.445 mins.
Epoch 99 train loss 1.333 acc: 104.523%; val loss: 1.326 acc 104.876%; use 9.452 mins.
```

All metrics are 11/8 of expected: 1.333 / (11/8)=0.969. It seems that results from 3 GPUs are repeated in all reduce. **The strange results only happen in the final epoch**. What could be the possible reasons?

Thanks!