Async/Parallel Evaluation of model

I have a machine with 10 GPUs and my utilization is quite bad (<50% spent training) because of the computation of the metrics I have to track each epoch (not everything can be easily parallelized across 10 GPUs and they are quite involved). I have already reduced the resolution for many metrics. They are very important to me and I can’t reduce them further. For some metrics, I generate matplotlib-plots, which is also quite costly.

I am thinking that switching from a purely sequential (train -> eval -> train -> eval …) to a parallel setup would greatly speed up my model (So for example 8-9 GPUs are constantly occupied with training and 1-2 with evaluating my metrics). Is this possible with pytorch? The only examples I’ve found are about parallelizing the training, but this is already working.

I would have to clone my model and push it to my evaluation-workers.

If you push a copy of the current model to GPU9 and execute the evaluation method, it should run asynchronously while your other GPUs are training.
Note that your data loading might become a bottleneck, if not multiple DataLoaders are trying to read from your drive.

Ok, I’ll test it.

If you push a copy of the current model to GPU9 and execute the evaluation method, it should run asynchronously while your other GPUs are training.

Could this be implemented using torch.multiprocessing or what can you recommend?

CUDA calls should run asynchronously by default.
Let me know, if you encounter an issue.