How to gather a list of strings from different rank machines in DDP mode with NCCL backend?

sherylwang · June 24, 2020, 11:26am

Hi,
These days I’ve accelerated the training of models with DistributedDataParallel. NCCL is used as the backend of torch.distributed. Currently, I try to do validation with a list of strings stored in the memory. However, with the multi-process mechanism, it’s hard to share the list across different ranks than in DP mode. Is there any good way to solve the problem?

mrshenli · June 24, 2020, 2:21pm

There was an PR to provide such a feature for general Python objects, but not landed yet. You can copy that code for now.

sherylwang · June 25, 2020, 2:25pm

Thanks for your reply. This design is interesting. @mrshenli