How to evaluate a model in distributed training

Hello family.

I have developed a script for DDP training following the Pytorch tutorials, and the vision/reference/classification/train.py. So my code is pretty similar to the original from the pytorch/vision.

Training and evaluation work fine. However, I am concerned with the evaluation metric. However, I have two different metrics because I have two GPUs (in one unique node). This can be seen in the following image.

My dataset has 5000 images; each GPU evaluates half of the data. But the thing is to evaluate all the 5000 images in one unique model, trained in a distributed manner. I would like to have some guidelines on this. This is a straightforward version of my code for evaluation:

import torch
from sklearn.metrics import roc_auc_score

model.eval()

y_true = list()
y_score = list()

for x, y in dataloader_test:
    x = x.to(args.device)
    output = model(x)
    y_true.extend(y.item()) # Binary values
    y_score.extend(output.item()) # Some score predicted

image_auroc = roc_auc_score(y_true, y_score)

Thanks

Update:

Let us suppose just the “y_true” variable. I want to have one unique list with the binaries from both GPUs. I added this code just before the last line,

y_true = torch.tensor(y_true, device=“cuda”)
torch.distributed.barrier()
y_true = torch.distributed.all_reduce(y_true)

The binary values are added because torch.distributed.all_reduce makes sum operations. What I want to do is to extend the y_true to have [y_true_from_gpu0, y_true_from_gpu1]. If this is achieved, I must replicate it with y_score to have one unique AUROC metric to have the evaluation done.

This example clarify what I mean

import torch
import os

rank = int(os.environ["RANK"])     # Rank of the current process in the job
world_size = int(os.environ["WORLD_SIZE"])     # Number of processes participating in the job
gpu = int(os.environ["LOCAL_RANK"])    # Rank of the current process in the local job
dist_url = 'env://'
torch.cuda.set_device(gpu)
torch.distributed.init_process_group(
    backend='nccl', init_method=dist_url, world_size=world_size, rank=rank
)
torch.distributed.barrier()

tensor = torch.arange(5,  device=gpu)
torch.distributed.barrier()
print(tensor) # tensor([0, 1, 2, 3, 4], device='cuda:1')tensor([0, 1, 2, 3, 4], device='cuda:0')
torch.distributed.all_reduce(tensor)
print(tensor) # tensor([0, 2, 4, 6, 8], device='cuda:0')tensor([0, 2, 4, 6, 8], device='cuda:1')
# However, I want to have [0, 1, 2, 3, 4, 0, 1, 2, 3, 4]
# Supposing that list is the final result, I would evaluate the model that holds in RANK == 0
# Is this the correct way to evaluate the performance in distributed training?
# Thanks

Currently, I have decided to use this as a solution.

image_auroc = roc_auc_score(y_true, y_score)
image_auroc = torch.tensor(image_auroc, device="cuda")
torch.distributed.barrier()
torch.distributed.all_reduce(image_auroc, op=ReduceOp.AVG)

Nevertheless, I would appreciate feedback about if this is the correct way to solve it and if the performance would be similar regarding one-GPU training. Thanks

What I want to do is to extend the y_true to have [y_true_from_gpu0, y_true_from_gpu1]. If this is achieved, I must replicate it with y_score to have one unique AUROC metric to have the evaluation done.

Sounds like all_gather would be a better fit for that: Distributed communication package - torch.distributed — PyTorch master documentation

torch.distributed.barrier()
torch.distributed.all_reduce(image_auroc, op=ReduceOp.AVG)

You don’t need to call barrier() before all_reduce() as all_reduce() is already a synchronization point

if the performance would be similar regarding one-GPU training.

Since you are using 2 GPUs to perform evaluation and splitting the data amongst the two it should theoretically be twice as fast minus the communication overhead.

Thanks for your help.

What do you think would be a better idea? Averaging the two final metrics or getting a list to evaluate one original model?

1 Like