I am in trouble since I don’t know how to manage evaluation phase with DistributedDataParallel model. In my evaluation loop I accumulate the correct predictions in order to compute the final accuracy per epoch. These predictions are stored inside a list of dictionaries. My model is wrapped in a Distributed DataParallel and so, each process will compute predictions on a separate portion of the dataset.
Unfortunately, predictions are not tensors and so I cannot use the utilities provided in
torch.distributed. I tried to save all the lists on disk and concatenate the results in the main process (
rank == 0) but this method will not work in distributed scenario where I have multiple nodes.
Do you know how to gather the list from all the different processes in order to compute the final accuracy per epoch?