I have a machine with 10 GPUs and my utilization is quite bad (<50% spent training) because of the computation of the metrics I have to track each epoch (not everything can be easily parallelized across 10 GPUs and they are quite involved). I have already reduced the resolution for many metrics. They are very important to me and I can’t reduce them further. For some metrics, I generate matplotlib-plots, which is also quite costly.
I am thinking that switching from a purely sequential (train -> eval -> train -> eval …) to a parallel setup would greatly speed up my model (So for example 8-9 GPUs are constantly occupied with training and 1-2 with evaluating my metrics). Is this possible with pytorch? The only examples I’ve found are about parallelizing the training, but this is already working.
I would have to clone my model and push it to my evaluation-workers.