Hello,
I am using DDP to train my model on 2 GPUs. At inference, here’s what I do :
self.model.eval()
for i, (_, visual, labels) in enumerate(self.test_data):
visual = visual.to(self.gpu_id)
labels = labels.to(self.gpu_id)
loss = self._run_batch_val(visual, labels)
And here’s what happends in the _run_batch_val function :
def _run_batch_val(self, visual, labels):
with torch.no_grad():
predictions, loss = self.model(visual, labels) # (batch_size, num_pathologies)
self.pred_test = np.vstack((self.pred_test, predictions.detach().cpu().tolist()))
self.labels_test = np.vstack((self.labels_test, labels.detach().cpu().tolist()))
return {'loss_val': loss.mean().item()}
Where self.pred_test and self.labels_test are numpy arrays of shape (n, 5) since it’s a multilabel classification problem (n samples, 5 labels).
So when computing metrics, I have 2 results : one per GPU. Do you know how to merge predictions and labels from the 2 GPUs to gather it and get one single overall metric ?
I heard about dist.gather_all but did not find a solution for my problem.
Hope it’s clear, thank you so much in advance!