How to Gather Prediction Results from Different Sub-Processes?

I am currently working with a Distributed Data Parallel (DDP) setup for deep learning models. You can find the details of my setup at [this link] (Can't distribute data to all GPUs with DDP) . In my setup, I use a Trainer object to make predictions for batches and then merge all the predictions into one convenient dictionary object. Each prediction is associated with a unique ID, which is why I chose to use a dictionary.

Once I have created the dictionary containing all the predictions for the dataset, I proceed to calculate performance metrics. However, I encounter an issue with the DDP setup: the predictions are split into two distinct dictionaries, each residing on a different GPU. This poses a challenge, as I need to gather all the predictions into a single dictionary to make accurate performance metric calculations.

I am seeking guidance on how to efficiently merge these two separate dictionary objects that reside on different GPUs into one dictionary. The merged dictionary can be reside on CPU. Any insights or solutions to address this problem would be greatly appreciated!

You can exchange them between all ranks by using all_gather_object:
https://pytorch.org/docs/master/distributed.html#torch.distributed.all_gather_object

1 Like