Distributed.all_reduce returns strange results

Hello, I use DDP module to train ImageNet. To collect training metrics from different GPUs, I use distributed.all_reduce. Here are some related codes:

local_rank = args.local_rank
torch.cuda.set_device(local_rank)
device = torch.device("cuda", local_rank)

for epoch in range(args.num_epoch + args.warmup_epoch):    
    start = time.time()
    train_loss, train_acc = utils.train_one_epoch(net, train_loader, criterion, optimizer, scheduler)
    val_loss, val_acc = utils.test_one_epoch(net, val_loader, criterion)
    #train_loss, train_acc, val_loss, val_acc are floating numbers 
    reduce_tensor = torch.tensor([train_loss, train_acc, val_loss, val_acc]).to(device)
    torch.distributed.all_reduce(reduce_tensor)
    reduce_tensor /= args.num_gpus
    # args.num_gpus = 8
    time_used = (time.time() - start) / 60.

    if local_rank == 0:
        print('Epoch %d train loss %.3f acc: %.3f%%; val loss: %.3f acc %.3f%%; use %.3f mins.'%
            (epoch, reduce_tensor[0], reduce_tensor[1], reduce_tensor[2], reduce_tensor[3], time_used))

I only get wrong results in the last epoch. Here are some logs:

log1:

Epoch 97 train loss 0.892 acc: 77.805%; val loss: 0.930 acc 77.010%; use 8.296 mins.
Epoch 98 train loss 0.887 acc: 77.922%; val loss: 0.931 acc 77.024%; use 8.305 mins.
Epoch 99 train loss 0.422 acc: 38.989%; val loss: 0.459 acc 38.506%; use 8.300 mins.

All metrics are 4/8 of expected. It seems that results from 4 GPUs are 0.

log2:

Epoch 96 train loss 0.973 acc: 75.933%; val loss: 0.967 acc 76.188%; use 9.449 mins.
Epoch 97 train loss 0.969 acc: 76.003%; val loss: 0.967 acc 76.148%; use 9.459 mins.
Epoch 98 train loss 0.969 acc: 76.029%; val loss: 0.967 acc 76.228%; use 9.445 mins.
Epoch 99 train loss 1.333 acc: 104.523%; val loss: 1.326 acc 104.876%; use 9.452 mins.

All metrics are 11/8 of expected: 1.333 / (11/8)=0.969. It seems that results from 3 GPUs are repeated in all reduce. The strange results only happen in the final epoch. What could be the possible reasons?

Thanks!

Your program seems to be correct, some questions:

  1. what backend are you using?

and could you please add these tests?:

  1. change device to “cpu”, will the error be the same?
  2. print rank, train_loss, train_acc, val_loss, val_acc in each of your process, before all_reduce

theoretically this problem should not happen

  1. the default gloo backend supports gpu all_reduce and broadcast
  2. you cannot repeat or let out a process, since all_reduce is blocking

Well, if you use nccl, then it must be a cuda tensor. please go on :slightly_smiling_face:

Thanks for responding! I use NCCL as backend:

torch.distributed.init_process_group(backend="nccl")

If I change the device to ‘cpu’, there is an error: Tensors must be CUDA and dense
I will try to print them before all_reduce, and see what happens.

Hey @KaiHoo can you print the reduce_tensor before you pass it to all_reduce, so that we can narrow down whether it is the all_reduce or the DDP training/testing that’s mal-bahaving.

@iffiX @mrshenli Hello, sorry for the late reply. Here are part of the codes:

Epoch 95 train loss 1.056 acc: 74.176%; val loss: 0.954 acc 75.958%; use 12.457 mins.
Epoch 96 train loss 1.048 acc: 74.339%; val loss: 0.949 acc 75.998%; use 12.459 mins.
Epoch 97 train loss 1.028 acc: 74.815%; val loss: 0.946 acc 76.232%; use 12.455 mins.
Epoch 98 train loss 1.027 acc: 74.855%; val loss: 0.946 acc 76.236%; use 12.475 mins.
Rank 5 train loss 1.026 acc: 74.890%; val loss: 0.972 acc 75.600%
Rank 6 train loss 1.025 acc: 74.815%; val loss: 0.924 acc 76.352%
Rank 3 train loss 1.025 acc: 74.889%; val loss: 0.957 acc 75.632%
Rank 1 train loss 1.032 acc: 74.757%; val loss: 0.929 acc 76.960%
Rank 7 train loss 1.023 acc: 75.038%; val loss: 0.930 acc 76.512%
Rank 2 train loss 1.019 acc: 75.013%; val loss: 0.958 acc 76.144%
Exception ignored in: <function WeakValueDictionary.__init__.<locals>.remove at 0x7ffb5070eea0>
Traceback (most recent call last):
  File "/usr/lib/python3.5/weakref.py", line 117, in remove
TypeError: 'NoneType' object is not callable
Rank 0 train loss 1.022 acc: 74.841%; val loss: 0.951 acc 76.304%
Epoch 99 train loss 0.513 acc: 37.451%; val loss: 0.470 acc 38.097%; use 12.467 mins.
Rank 4 train loss 1.023 acc: 74.984%; val loss: 0.947 acc 76.304%

Since the strange results only happen in the final epoch, I only print the metrics for the last epoch. The order of logs are exactly what I got, though the ‘Rank 4’ line should be printed before the ‘Epoch 99 train’ line:

for epoch in range(args.num_epoch + args.warmup_epoch):    
    start = time.time()
    train_loss, train_acc = utils.train_one_epoch(net, train_loader, 
        criterion, optimizer, mean_and_std, scheduler, args)
    val_loss, val_acc = utils.test_one_epoch(net, val_loader, criterion, mean_and_std)
    reduce_tensor = torch.tensor([train_loss, train_acc, val_loss, val_acc]).to(device)
    if epoch == args.num_epoch + args.warmup_epoch - 1:
        print('Rank %d train loss %.3f acc: %.3f%%; val loss: %.3f acc %.3f%%'%
            (local_rank, reduce_tensor[0], reduce_tensor[1], reduce_tensor[2], reduce_tensor[3]))
    torch.distributed.all_reduce(reduce_tensor)
    reduce_tensor /= args.num_gpus
    time_used = (time.time() - start) / 60.

    if local_rank == 0:
        print('Epoch %d train loss %.3f acc: %.3f%%; val loss: %.3f acc %.3f%%; use %.3f mins.'%
            (epoch, reduce_tensor[0], reduce_tensor[1], reduce_tensor[2], reduce_tensor[3], time_used))

Is the above error expected? How did you handle this? If this is handled by skipping/redoing that iteration, it might cause allreduce mismatch.

I have no idea about this error, though nothing happened. This bug is reported to PyTorch, but seems a bug of python:

I’m a bit curious about this bug, did anyone got to the bottom of this ? @KaiHoo, just in case, are you spreading the GPUs across different nodes ?

It is a single node DDP