I am using DistributedDataParallel on a single machine with multiple GPUs, and I’m having trouble collecting loss and accuracy between GPUs.
I have each process (GPU) printing the loss and accuracy of its training, but I want to track the overall loss and accuracy. I have tried using multiprocessing.RLock as an argument to torch.multiprocessing.spawn, but this fails.
What is the best way to collect the results when training with DistributedDataParallel?
For anyone wondering, I solved this by using torch.multiprocessing.Manager
In main function
from torch.multiprocessing import Manager
from torch import multiprocessing as mp
from copy import deepcopy
with Manager() as manager:
train_results = manager.list()
# spawn
mp.spawn(train_worker, nprocs=ngpus, args=(train_results))
# copy out
results = deepcopy(train_results)
# postprocess results to collect data