DistributedDataParallel on evaluation phase

Hi,
I used the example of the following link " 💥 Training Neural Nets on Larger Batches: Practical Tips for 1-GPU, Multi-GPU & Distributed setups | by Thomas Wolf | HuggingFace | Medium"
I would like to put the test dataset in multiple GPU since it is very large, However, the problem is I cannot gather all the results from GPU after finishing the test on each GPU.
Is there any coder that I can use?
Thanks

Hi @farhadzadeh,
Could you give me a code snippet of what you are doing at the moment?
That way it will be easier to help :slight_smile:

the evaluation which works in each GPU

.....
        with torch.no_grad():
            val_outputs=[]
            val_labels=[]
            for batch_idx, (images_id, images, _, _, labels) in tqdm(enumerate(self.val_dataloader_2), desc="validation"):
                labels = labels.to(self.device)
                images = images.to(self.device)

                # #step=0
                output_0 = self.model_image2text(images, mode='sample')
                ##### required for the next step
                output_from_image2text = self.image2text_tokenizer.decode_batch(output_0.cpu().numpy())
                input_encodings = self.tokenizer.batch_encode_plus(output_from_image2text, return_tensors="pt",
                                                                         pad_to_max_length=True,
                                                                         max_length=self.args.max_seq_length,
                                                                         truncation=True)
                input_ids = input_encodings['input_ids']
                attention_mask = input_encodings['attention_mask']

                outputs = self.model(input_ids.to(self.device), attention_mask.to(self.device), mode='sample')
                for output in outputs:
                    val_outputs.append(output)
                for label in labels:
                    val_labels.append(label)

            self.val_outputs.append(torch.stack(val_outputs).to(self.device))
            self.val_labels.append(torch.stack(val_labels).to(self.device))

After that I would like to gather all val_outputs:

.....
          torch.distributed.barrier()
            if torch.distributed.get_rank() in [-1,0]:
                print(f"all: {len(self.val_outputs)}")
                torch.distributed.all_reduce_multigpu(self.val_outputs)

                print(f"all: {len(self.val_outputs)}")
                torch.distributed.all_reduce_multigpu(self.val_labels)

Could you please try to use torch.distributed.all_reduce(self.val_outputs)?

Thanks to respond. However, I found out all_reduce, is for some reduceOP such as Max, SUM and so on.
The one that I was looking for is torch.distribution.all.gather(list_tensor, tenso)

My training flow is running fine but my evaluation part is running 4 times instead of just one (I am using 4 GPUs). I am using distributed sampler for both the train and val dataset.

Any idea of what could be wrong or how I can debug the same?