I am using
DistributedDataParallel on a single machine with multiple GPUs, and I’m having trouble collecting loss and accuracy between GPUs.
I have each process (GPU) printing the loss and accuracy of its training, but I want to track the overall loss and accuracy. I have tried using
multiprocessing.RLock as an argument to
torch.multiprocessing.spawn, but this fails.
What is the best way to collect the results when training with
For anyone wondering, I solved this by using
In main function
from torch.multiprocessing import Manager
from torch import multiprocessing as mp
from copy import deepcopy
with Manager() as manager:
train_results = manager.list()
mp.spawn(train_worker, nprocs=ngpus, args=(train_results))
# copy out
results = deepcopy(train_results)
# postprocess results to collect data
def train_worker(tid, train_data):
train_data.append((tid, epoch_num, loss,acc,time,num_correct))
Then after training you can use
pandas to do some stats and collection of data.