In replacing DataParallel with DistributedDataParallel, I encountered that each epoch compute a number of metrics. While using DataParallel this still was fine, as everything was running in the same process.
However, I would like to collect the metrics on the first GPU as well. There metric could either be computed on the GPU or CPU. As I understand, with DistributedDataParallel in case there are 2 GPUs, 3 processes are started (one to collect things). Then, loss.backward() should work as expected. But how do I do this for my metrics?
Let me answer my own question. I think I got it right now. So I store my metrics in a dictionary {'batch_metric_0': etc}. Initially these were numpy arrays, but I converted the code to torch, I assume if this is not possible, you could otherwise dump to a pickle and use ByteTensor.
Then you can collect them together by iterating over the dictionary like (you can use torch.no_grad()):
for k in sorted(tensors_dict.keys()):
tensor_names.append(k)
all_tensors.append(tensors_dict[k])
all_tensors = torch.stack(all_tensors, dim=0)
Then, torch.distributed.reduce(all_tensors, dst=0) collects everything on device 0. Then you need to divide by WORLD_SIZE only for device 0 (as the other processes are not collected, and we are not interested in those.
There is no separate process to collect things. You are responsible for starting all processes, either through running them yourself from a terminal, through torch.distributed.launch, or with some other runner such as mpirun. The typical mode of execution is to use 1 GPU per process.