Hi!
I am using a nn.parallel.DistributedDataParallel
model for both training and inference on multiple gpu.
To achieve that I use
mp.spawn(evaluate, nprocs=n_gpu, args=(args, eval_dataset))
To evaluate I actually need to first run the dev dataset examples through a model and then to aggregate the results. Therefore I need to be able to return my predictions to the main process (possibly in a dict, but some other data structure should work as well). I’ve tried providing an extra dict argument mp.spawn(evaluate, nprocs=n_gpu, args=(args, eval_dataset, out_dict))
and modifying it in the function but apparently spawn copies it, so the dict in the main process is not modified.
I guess, I could write the results to the file and then read in the main process but it doesn’t seem like the most elegant solution. Is there a better way to return values from spawned functions?
Thanks!
Is there a better way to return values from spawned functions?
If you want to pass the result from spawned processes back to the parent process, you can let the parent process create multiprocessing queues, pass it to children processes, and let children processes send result back through the queue. See the following code:
If the result does not have to go back to the parent process, you can use gather or allgather to communicate the result across children processes.
Hi Arina. What was your solution at the end?
Hi, for me writing to files actually worked out quite alright.
I’m just using pickle.dump
in the spawned processes and pickle.load
in the main one
Actually met this problem once again recently and decided to make use of the mp.Queue as @mrshenli suggested, so I used
result_queue = mp.Queue()
for rank in range(args.n_gpu):
mp.Process(target=get_features, args=(rank, args, eval_dataset, result_queue)).start()
to start the processes instead of mp.spawn and just added the results in my queue with
result_queue.put((preds, labels, ranks))
in those processes.
To later collect the results I did the following:
for _ in range(args.n_gpu):
temp_result = result_queue.get()
preds.append(temp_result[0])
labels.append(temp_result[1])
ranks.append(temp_result[2])
del temp_result