Sharing list in DistributedDataParallel

Seo · January 12, 2021, 5:27pm

Hi everyone,
I am in trouble since I don’t know how to manage evaluation phase with DistributedDataParallel model. In my evaluation loop I accumulate the correct predictions in order to compute the final accuracy per epoch. These predictions are stored inside a list of dictionaries. My model is wrapped in a Distributed DataParallel and so, each process will compute predictions on a separate portion of the dataset.
Unfortunately, predictions are not tensors and so I cannot use the utilities provided in torch.distributed. I tried to save all the lists on disk and concatenate the results in the main process (rank == 0) but this method will not work in distributed scenario where I have multiple nodes.

Do you know how to gather the list from all the different processes in order to compute the final accuracy per epoch?

pritamdamania87 · January 13, 2021, 12:06am

@Seo You can probably use gather_object to gather objects on a single rank which are not tensors.

Seo · January 13, 2021, 2:13pm

Hi @pritamdamania87, thank you for your answer. I didn’t notice the gather_object because it is a new feature included in the current 1.8.0 torch nightly. Unfortunately, the gather_object doesn’t work with NCCL backend, so I have used the all_gather_object to broadcast the results to all the processes (and then use it only in the main one). I hope gather_object will be available for NCCL backend in the next release.
Thank you!

pritamdamania87 · January 14, 2021, 9:47pm

Looks like gather is not supported in NCCL as a native API. Although I think we can support gather in PyTorch for the NCCL backend by using ncclSend/ncclRecv under the hood. cc @rvarm1 @Yanli_Zhao

ibro45 · March 8, 2021, 12:46pm

That would be great, for the moment (1.8.0) gather_object still doesn’t work with NCCL backend.