Gathering results from DDP

TeDataPro · April 24, 2024, 9:27am

Hello,

I am using DDP to train my model on 2 GPUs. At inference, here’s what I do :

self.model.eval()
for i, (_, visual, labels) in enumerate(self.test_data):
    visual = visual.to(self.gpu_id)
    labels = labels.to(self.gpu_id)
    loss   = self._run_batch_val(visual, labels)

And here’s what happends in the _run_batch_val function :

  def _run_batch_val(self, visual, labels):
      with torch.no_grad():
          predictions, loss = self.model(visual, labels) # (batch_size, num_pathologies)
      self.pred_test = np.vstack((self.pred_test, predictions.detach().cpu().tolist()))
      self.labels_test = np.vstack((self.labels_test, labels.detach().cpu().tolist()))

      return {'loss_val': loss.mean().item()}

Where self.pred_test and self.labels_test are numpy arrays of shape (n, 5) since it’s a multilabel classification problem (n samples, 5 labels).

So when computing metrics, I have 2 results : one per GPU. Do you know how to merge predictions and labels from the 2 GPUs to gather it and get one single overall metric ?

I heard about dist.gather_all but did not find a solution for my problem.

Hope it’s clear, thank you so much in advance!