Distributed.all_reduce returns strange results

KaiHoo · July 15, 2020, 3:57am

Hello, I use DDP module to train ImageNet. To collect training metrics from different GPUs, I use distributed.all_reduce. Here are some related codes:

local_rank = args.local_rank
torch.cuda.set_device(local_rank)
device = torch.device("cuda", local_rank)

for epoch in range(args.num_epoch + args.warmup_epoch):    
    start = time.time()
    train_loss, train_acc = utils.train_one_epoch(net, train_loader, criterion, optimizer, scheduler)
    val_loss, val_acc = utils.test_one_epoch(net, val_loader, criterion)
    #train_loss, train_acc, val_loss, val_acc are floating numbers 
    reduce_tensor = torch.tensor([train_loss, train_acc, val_loss, val_acc]).to(device)
    torch.distributed.all_reduce(reduce_tensor)
    reduce_tensor /= args.num_gpus
    # args.num_gpus = 8
    time_used = (time.time() - start) / 60.

    if local_rank == 0:
        print('Epoch %d train loss %.3f acc: %.3f%%; val loss: %.3f acc %.3f%%; use %.3f mins.'%
            (epoch, reduce_tensor[0], reduce_tensor[1], reduce_tensor[2], reduce_tensor[3], time_used))

I only get wrong results in the last epoch. Here are some logs:

log1:

Epoch 97 train loss 0.892 acc: 77.805%; val loss: 0.930 acc 77.010%; use 8.296 mins.
Epoch 98 train loss 0.887 acc: 77.922%; val loss: 0.931 acc 77.024%; use 8.305 mins.
Epoch 99 train loss 0.422 acc: 38.989%; val loss: 0.459 acc 38.506%; use 8.300 mins.

All metrics are 4/8 of expected. It seems that results from 4 GPUs are 0.

log2:

Epoch 96 train loss 0.973 acc: 75.933%; val loss: 0.967 acc 76.188%; use 9.449 mins.
Epoch 97 train loss 0.969 acc: 76.003%; val loss: 0.967 acc 76.148%; use 9.459 mins.
Epoch 98 train loss 0.969 acc: 76.029%; val loss: 0.967 acc 76.228%; use 9.445 mins.
Epoch 99 train loss 1.333 acc: 104.523%; val loss: 1.326 acc 104.876%; use 9.452 mins.

All metrics are 11/8 of expected: 1.333 / (11/8)=0.969. It seems that results from 3 GPUs are repeated in all reduce. The strange results only happen in the final epoch. What could be the possible reasons?

Thanks!

iffiX · July 15, 2020, 4:58am

Your program seems to be correct, some questions:

what backend are you using?

and could you please add these tests?:

change device to “cpu”, will the error be the same?
print rank, train_loss, train_acc, val_loss, val_acc in each of your process, before all_reduce

theoretically this problem should not happen

the default gloo backend supports gpu all_reduce and broadcast
you cannot repeat or let out a process, since all_reduce is blocking

iffiX · July 15, 2020, 5:10am

Well, if you use nccl, then it must be a cuda tensor. please go on

KaiHoo · July 15, 2020, 5:10am

Thanks for responding! I use NCCL as backend:

torch.distributed.init_process_group(backend="nccl")

If I change the device to ‘cpu’, there is an error: Tensors must be CUDA and dense
I will try to print them before all_reduce, and see what happens.

mrshenli · July 16, 2020, 3:07pm

Hey @KaiHoo can you print the reduce_tensor before you pass it to all_reduce, so that we can narrow down whether it is the all_reduce or the DDP training/testing that’s mal-bahaving.

KaiHoo · July 26, 2020, 3:53am

@iffiX @mrshenli Hello, sorry for the late reply. Here are part of the codes:

Epoch 95 train loss 1.056 acc: 74.176%; val loss: 0.954 acc 75.958%; use 12.457 mins.
Epoch 96 train loss 1.048 acc: 74.339%; val loss: 0.949 acc 75.998%; use 12.459 mins.
Epoch 97 train loss 1.028 acc: 74.815%; val loss: 0.946 acc 76.232%; use 12.455 mins.
Epoch 98 train loss 1.027 acc: 74.855%; val loss: 0.946 acc 76.236%; use 12.475 mins.
Rank 5 train loss 1.026 acc: 74.890%; val loss: 0.972 acc 75.600%
Rank 6 train loss 1.025 acc: 74.815%; val loss: 0.924 acc 76.352%
Rank 3 train loss 1.025 acc: 74.889%; val loss: 0.957 acc 75.632%
Rank 1 train loss 1.032 acc: 74.757%; val loss: 0.929 acc 76.960%
Rank 7 train loss 1.023 acc: 75.038%; val loss: 0.930 acc 76.512%
Rank 2 train loss 1.019 acc: 75.013%; val loss: 0.958 acc 76.144%
Exception ignored in: <function WeakValueDictionary.__init__.<locals>.remove at 0x7ffb5070eea0>
Traceback (most recent call last):
  File "/usr/lib/python3.5/weakref.py", line 117, in remove
TypeError: 'NoneType' object is not callable
Rank 0 train loss 1.022 acc: 74.841%; val loss: 0.951 acc 76.304%
Epoch 99 train loss 0.513 acc: 37.451%; val loss: 0.470 acc 38.097%; use 12.467 mins.
Rank 4 train loss 1.023 acc: 74.984%; val loss: 0.947 acc 76.304%

Since the strange results only happen in the final epoch, I only print the metrics for the last epoch. The order of logs are exactly what I got, though the ‘Rank 4’ line should be printed before the ‘Epoch 99 train’ line:

for epoch in range(args.num_epoch + args.warmup_epoch):    
    start = time.time()
    train_loss, train_acc = utils.train_one_epoch(net, train_loader, 
        criterion, optimizer, mean_and_std, scheduler, args)
    val_loss, val_acc = utils.test_one_epoch(net, val_loader, criterion, mean_and_std)
    reduce_tensor = torch.tensor([train_loss, train_acc, val_loss, val_acc]).to(device)
    if epoch == args.num_epoch + args.warmup_epoch - 1:
        print('Rank %d train loss %.3f acc: %.3f%%; val loss: %.3f acc %.3f%%'%
            (local_rank, reduce_tensor[0], reduce_tensor[1], reduce_tensor[2], reduce_tensor[3]))
    torch.distributed.all_reduce(reduce_tensor)
    reduce_tensor /= args.num_gpus
    time_used = (time.time() - start) / 60.

    if local_rank == 0:
        print('Epoch %d train loss %.3f acc: %.3f%%; val loss: %.3f acc %.3f%%; use %.3f mins.'%
            (epoch, reduce_tensor[0], reduce_tensor[1], reduce_tensor[2], reduce_tensor[3], time_used))

mrshenli · July 27, 2020, 3:28pm

KaiHoo:

Exception ignored in: <function WeakValueDictionary.__init__.<locals>.remove at 0x7ffb5070eea0>
Traceback (most recent call last):
  File "/usr/lib/python3.5/weakref.py", line 117, in remove
TypeError: 'NoneType' object is not callable

Is the above error expected? How did you handle this? If this is handled by skipping/redoing that iteration, it might cause allreduce mismatch.

KaiHoo · July 29, 2020, 2:55pm

I have no idea about this error, though nothing happened. This bug is reported to PyTorch, but seems a bug of python:

github.com/pytorch/pytorch

torch.nn.parallel.data_parallel crashes machine and has a weakref bug

opened 05:04PM - 03 Aug 19 UTC

closed 07:46PM - 07 Feb 22 UTC

PetrochukM

high priority triaged module: data parallel

## 🐛 Bug For some reason, this happened: ``` Exception ignored in: <funct…ion WeakValueDictionary.__init__.<locals>.remove at 0x7f4b13e53158> Traceback (most recent call last): File "/usr/lib/python3.5/weakref.py", line 117, in remove TypeError: 'NoneType' object is not callable ``` ## To Reproduce ``` import torch class Model(torch.nn.Module): def forward(self, a): return a device = torch.device('cuda') model = Model() model.to(device) a = torch.randint(0, 255, (256, 900, 2), device=device) print('`torch.nn.parallel.data_parallel`') torch.nn.parallel.data_parallel(module=model, inputs=(a,)) ``` ## Environment Collecting environment information... PyTorch version: 1.1.0 Is debug build: No CUDA used to build PyTorch: 9.0.176 OS: Debian GNU/Linux 9.9 (stretch) GCC version: (Debian 6.3.0-18+deb9u1) 6.3.0 20170516 CMake version: Could not collect Python version: 3.5 Is CUDA available: Yes CUDA runtime version: 10.1.105 GPU models and configuration: GPU 0: Tesla V100-SXM2-16GB GPU 1: Tesla V100-SXM2-16GB GPU 2: Tesla V100-SXM2-16GB GPU 3: Tesla V100-SXM2-16GB GPU 4: Tesla V100-SXM2-16GB GPU 5: Tesla V100-SXM2-16GB GPU 6: Tesla V100-SXM2-16GB GPU 7: Tesla V100-SXM2-16GB Nvidia driver version: 418.67 cuDNN version: Could not collect Versions of relevant libraries: [pip3] intel-numpy==1.15.1 [pip3] numpy==1.15.1 [pip3] torch==1.1.0 [conda] Could not collect

Benjamin_Lefaudeux · February 17, 2021, 12:52am

I’m a bit curious about this bug, did anyone got to the bottom of this ? @KaiHoo, just in case, are you spreading the GPUs across different nodes ?

KaiHoo · February 17, 2021, 9:14am

It is a single node DDP