This paper (SimCLR) introduced a self-supervised learning method. In this method, the InfoNCE loss is computed in a batch level with the feature similarities between different inputs. The paper also point out that a bigger batch size makes a better performance.
For example, I have a 8GPU machine and I can put 128 images on each GPU. If I use the trivial implementation, I will get eight 128128 similarity metrics. How could I get one 10241024 similarity matrix? I was thinking about use all-reduce, but I am not sure if the gradient can be passed to all GPUs. Any methods to implement it? Thanks!
Edit: to make it more clear, suppose we have B images and the feature extracted from the model is x1, x2, …, xB, the loss function takes the pairwise dot product similarity as inputs. Now I can only compute the pairwise similarity (128 times 128) on each GPU, sum up the loss from 8 GPUs and do backward. I hope to compute the pairwise similarity (1024 times 1024) and directly do backward. How can we do this? Thanks!