(Distributed) Why all gpus are giving the same output?

Hi, I am a new comer to PyTorch and I’m confused when I am running the official example of torch.distributed
at PyTorch ImageNet main.py L304.

I have made some small modification on the evaluation part of the source code like below:

model.eval()
    with torch.no_grad():
        end = time.time()
        for i, (images, target, image_ids) in enumerate(val_loader):
            if args.gpu is not None:
                images = images.cuda(args.gpu, non_blocking=True)

            target = target.cuda(args.gpu, non_blocking=True)
            image_ids = image_ids.data.cpu().numpy()
            output = model(images)
            loss = criterion(output, target)

            # Get acc1, acc5 and update
            acc1, acc5 = accuracy(output, target, topk=(1, 5))
            losses.update(loss.item(), images.size(0))
            top1.update(acc1[0], images.size(0))
            top1.update(acc1[0], images.size(0))
            top5.update(acc5[0], images.size(0))
            
            # print at i-th batch of images only
            dist.barrier()
            if i==0:
                if args.gpu==0:
                    print("gpu 0",acc1,output.shape)
                if args.gpu==1:
                    print("gpu 1",acc1,output.shape)
                if args.gpu==2:
                    print("gpu 2",acc1,output.shape)
                if args.gpu==3:
                    print("gpu 3",acc1,output.shape)

And above code gives the following output:

Use GPU: 0 for training
Use GPU: 1 for training
Use GPU: 3 for training
Use GPU: 2 for training
=> loading checkpoint model_best.pth.tar'
...
gpu 3 tensor([75.], device='cuda:3') torch.Size([32, 200])
gpu 2 tensor([75.], device='cuda:2') torch.Size([32, 200])
gpu 1 tensor([75.], device='cuda:1') torch.Size([32, 200])
gpu 0 tensor([75.], device='cuda:0') torch.Size([32, 200])

As I am using 4 GPU with a batch size of 128, I think 128 images have been divided and fed into 4 GPU respectively. So all the four GPU have output.shape[0]=32(where 200 is num_classes).

But what has really confused me is that, all the 4 GPU are showing the same acc1. In my understanding, as 4 GPUs are taking different input portion (32 images respectively), they should also give different output and accuracy corresponding to their input respectively. However, in my print test, these GPU are showing the same output and accuracy. And I don’t know why, shouldn’t they be different ?

Looking for help. Thank you in advance !

Okay, I think that maybe I have found the answer at its Github issues.distributed eval to be done

That’s correct. Without a distributed sampler for the evaluation dataset, the different processes end up processing the same evaluation inputs, and correspondingly give the same accuracy.