Reducing metrics with DistributedDataParallel

In replacing DataParallel with DistributedDataParallel, I encountered that each epoch compute a number of metrics. While using DataParallel this still was fine, as everything was running in the same process.

However, I would like to collect the metrics on the first GPU as well. There metric could either be computed on the GPU or CPU. As I understand, with DistributedDataParallel in case there are 2 GPUs, 3 processes are started (one to collect things). Then, loss.backward() should work as expected. But how do I do this for my metrics?

1 Like

Let me answer my own question. I think I got it right now. So I store my metrics in a dictionary {'batch_metric_0': etc}. Initially these were numpy arrays, but I converted the code to torch, I assume if this is not possible, you could otherwise dump to a pickle and use ByteTensor.

Then you can collect them together by iterating over the dictionary like (you can use torch.no_grad()):

for k in sorted(tensors_dict.keys()):
    tensor_names.append(k)
    all_tensors.append(tensors_dict[k])
all_tensors = torch.stack(all_tensors, dim=0)

Then, torch.distributed.reduce(all_tensors, dst=0) collects everything on device 0. Then you need to divide by WORLD_SIZE only for device 0 (as the other processes are not collected, and we are not interested in those.

Hope this is helpful to someone.

1 Like

There is no separate process to collect things. You are responsible for starting all processes, either through running them yourself from a terminal, through torch.distributed.launch, or with some other runner such as mpirun. The typical mode of execution is to use 1 GPU per process.

Great solution! Thanks for reporting back and sharing it.