Training multiple pytorch models concurrently leading to longer training time for each model

jzhao004 · November 1, 2021, 10:59am

I’m currently training two variations of the following model (GitHub - TengdaHan/DPC: Video Representation Learning by Dense Predictive Coding. Tengda Han, Weidi Xie, Andrew Zisserman.).

I’ve noticed that when training both models concurrently (each on a dedicated GPU), the training time for each model increases by ~25%, as compared to when training only one model at a time.

Currently, GPU usage is >90% for both models.CPU usage is ~60% and RAM usage is ~25% when both models are trained concurrently.

Here are the profiling results (sorted by total CPU time, total CUDA time, CPU memory usage, and CUDA memory usage).

ptrblck · November 1, 2021, 8:49pm

Your screenshots are a bit hard to decipher, however based on some values I could read it seems that the CUDA operations are using approx. the same amount of time, while CPU ops are a bit slower.
This could point towards a CPU-bound workload and the host might not be fast enough to run ahead and schedule the GPU work fast enough if you start multiple processes.

jzhao004 · November 2, 2021, 3:28am

Thank you for your reply! I’ve updated the screenshot to a higher resolution version.