How to collect outputs from different GPUs, compute a batch level loss and do backward

This paper (SimCLR) introduced a self-supervised learning method. In this method, the InfoNCE loss is computed in a batch level with the feature similarities between different inputs. The paper also point out that a bigger batch size makes a better performance.

For example, I have a 8GPU machine and I can put 128 images on each GPU. If I use the trivial implementation, I will get eight 128128 similarity metrics. How could I get one 10241024 similarity matrix? I was thinking about use all-reduce, but I am not sure if the gradient can be passed to all GPUs. Any methods to implement it? Thanks!

Edit: to make it more clear, suppose we have B images and the feature extracted from the model is x1, x2, …, xB, the loss function takes the pairwise dot product similarity as inputs. Now I can only compute the pairwise similarity (128 times 128) on each GPU, sum up the loss from 8 GPUs and do backward. I hope to compute the pairwise similarity (1024 times 1024) and directly do backward. How can we do this? Thanks!

@KaiHoo What is the operation you want to perform across these 8 128x128 matrices? Allreduce will ensure that each GPU ends up with the average (or sum) of the matrices on all the GPUs. There are numerous such collective operations that you can perform to communicate data across GPUs in the distributed package that may be useful (docs here:

Hello, I made my question more clear. I know the Allreduce op, but I am not sure if the gradient could pass this op? Thanks!

Yes, the gradients can be passed to collective operations. You can access the gradient tensors by checking the .grad field of each of the parameters.