DDP and performance calculation

Hello everyone, I have a trivial question but could not find a proper answer on the internet so here is my question.

After every 10th epoch in my training loop, I call a function that gauges the model’s performance (using sklearn) in the validation set. If I set up 4GPUs, this function is calculating and prints 4 different results.

This works fine with a single GPU, but with DDP, as each GPU (process) gets part of the data, I do not think that any given process sees the entire validation set. How to solve this issue? Can we calculate the performance in individual GPUs and combine or maybe evaluate the performance on only a single GPU? I have the code snippet below.

for epoch in range(N_epochs):
    running_loss = 0.0
    for i, data in enumerate(train_loader, 0):
        inputs, labels = data

        inputs = inputs.to(device) 
        labels = labels.to(device=device) 

        optimizer.zero_grad()
        y_pred = model(inputs)

        loss = criterion(y_pred,labels)
        running_loss += float(loss.item())

        loss.backward()
        optimizer.step()

    out.write("Epoch: %d Loss %.2f\n" % (epoch, running_loss))
    out.flush()

    if (epoch % 10 == 0 and epoch !=0):
        model.eval()
        # Save  the model
        checkpoint = {'epoch': epoch, 'Epsilon': float(hyper_params["Epsilon"]), 'mnt': float(hyper_params["mnt"]), 'weight_dcay': float(hyper_params["weight_dcay"]),
        'state_dict': model.module.state_dict(), 'optimizer': optimizer.state_dict(), 'Classes': int(hyper_params["Classes"]), 'context': int(hyper_params["context"]), 'Mseed': int(hyper_params["Mseed"]), 'actvtnF': int(hyper_params["actvtnF"]),
        'CNN_bias': str(hyper_params["CNN_bias"]), 'Pool': int(hyper_params["Pool"])}

        checkpoint_name = "Trained_" + str(epoch) + ".pth"
        save_checkpoint(checkpoint, checkpoint_name, multipleGPU_params)

        # Gauge the performance  of the model
        f1 = evaluate_model(model, validation_loader, "validation", out, device, hyper_params)
        out.write("Validation F1-score (macro): {} .\n".format(f1))

The code snippet for the evaluate_model() function is as follows

def evaluate_model(net, data_loader, test_type, out, device, hyper_params):
    y_true = []
    y_pred = []

    with torch.no_grad():
        for data in data_loader:
            images, labels = data
            images = images.to(device) 

            labels = labels.to('cpu')
            y_true = y_true + torch.flatten(labels).tolist()

            outputs = net(images)
            _, predicted = torch.max(outputs.data, 1)
            predicted = predicted.to('cpu')
            y_pred = y_pred + torch.flatten(predicted).tolist()
                
    out.write("Performance on " + test_type + " data\n")
    out.write("=========================\n")
    f1 = -1

    out.write(classification_report(y_true, y_pred))
    f1 = f1_score(y_true, y_pred, average='macro')
    out.write("F1 score on " + test_type + str(f1) + "\n")

Now the output looks like below for the training loss

Epoch: 7 Loss 29.49
Epoch: 7 Loss 43.74
Epoch: 7 Loss 41.50
Epoch: 7 Loss 33.20

For every epoch, 4 different loses are printed (4-GPU system). And the output from the evaluate_model() is also doing the same, printing 4 different performance values. I have a feeling that I have to use all_gather or all_reduce but I am not sure how to use it. If someone can offer generous help, it would be greatly appreciated.

Thank you.

I would check if you could adapt something like what is done for the metrics in the ImageNet example for your use case.