Hi everyone! I use a piece of PyTorch code that runs on a single machine distributed setting. The code contains all_gather and all_reduce operations to gather predictions from each gpu and calculate metrics, respectively, without any noticeable slowing down in training speed. I recently added an extra bit in the code for a new thing that I’m trying where I need to broadcast a probability sampled from a uniform distribution (single number) in all gpus, so all of them have the same probability value. I added it within the training loop and looks like this:
Hi David! Thank you for your comment. I also thought of that but the PyTorch documentation got me confused. In particular, in torch.Tensor.cuda — PyTorch 1.9.0 documentation for the device argument, they say: device (torch.device) – The destination GPU device. Defaults to the current CUDA device.
So I thought that if you don’t specify it the default would be the current CUDA device rather than always 'cuda:0'??