How To Collect Results in `DistributedDataParallel`

Alex_Hurt · February 14, 2019, 9:32pm

I am using DistributedDataParallel on a single machine with multiple GPUs, and I’m having trouble collecting loss and accuracy between GPUs.

I have each process (GPU) printing the loss and accuracy of its training, but I want to track the overall loss and accuracy. I have tried using multiprocessing.RLock as an argument to torch.multiprocessing.spawn, but this fails.

What is the best way to collect the results when training with DistributedDataParallel?

Alex_Hurt · February 15, 2019, 3:00am

For anyone wondering, I solved this by using torch.multiprocessing.Manager

In main function

from torch.multiprocessing import Manager
from torch import multiprocessing as mp
from copy import deepcopy

with Manager() as manager:
    train_results = manager.list()
    # spawn
    mp.spawn(train_worker, nprocs=ngpus, args=(train_results))
    # copy out
    results = deepcopy(train_results)

# postprocess results to collect data

In train_worker:

def train_worker(tid, train_data):
    ...
    train_data.append((tid, epoch_num, loss,acc,time,num_correct))
    ...

Then after training you can use pandas to do some stats and collection of data.