Torch.distributed.all_gather() to compute Noise Contrastive Loss with PyTorch

joohyunglee · September 23, 2022, 5:33am

Computing infoNCE requires gathering all encoded representations from all GPUs for full negative sampling. Specifically, to compute infoNCE loss, many repositories, e.g. SimCLR, essl, uses 1) torch.distributed.all_gather() to gather features from all GPUs at forward() and 2) torch.distributed.all_reduce() to sum the gradients at backward().

However, wouldn’t the synchronized representations (using torch.distributed.all_gather()) make all GPUs compute the exact same loss at all GPUs and are thus redundant?

How about sending all features to a single machine (either GPU or CPU), computing the loss there, and multiplying the gradient by the # of GPUs? because gradient will be averaged throughout GPUs

kwen2501 · September 27, 2022, 3:48pm

That’s a fair concern. I don’t know if the program uses DDP. With DDP, since the data input to each GPU would be different, maybe the loss computed at each GPU would be different? And hence a need to all-reduce the gradients?

joohyunglee · September 27, 2022, 9:43pm

Input to GPUs are obviously different. However, i think the representation from the encoder becomes the same due to all_reduce()(most infoNCE self-sup repositories synchronize the representation to use all representation for the negative sample). Therefore, i assume gathering all representation to a single machine, eg gpu or cpu, may reduce the redundant loss computation.

And my question is if my concern is correct or not. Thank you!

xuef · September 30, 2022, 8:02pm

I would agree with the concern. This can be done a little differently such that each GPU only calculate the loss for its own batch after all_gather, that way each GPU is doing similar computing, and gradient reduce would carry to all GPUs.

zhenxz0121 · December 29, 2023, 4:39pm

zsnoob/EfficientDDP-4-Contrastive-Train: Optimizing the way of contrastive learning in PyTorch-DDP(DistributedDataParallel) multi-GPU training (github.com) There is a github repository about that question, and some optimization of your concern about “all machine are doing the same thing in the loss calculation”. Let’s discuss about it.