GPU utilization low until validation

Hi,
I am running a DDP run with 2 GPUs, and at every 10 epochs I evaluate my test data.

For some reason, the first 10 epochs run very slow - until the evaluation process occurs. Can anyone point me to the reason for this?

Attaching visualization of GPU utilization:

My train/test dataloaders use utils.data.distributed.DistributedSampler as samplers

Can you share a minimal reproducible example?

At the moment, my code is a bit too complex to share.
I was hoping maybe someone might be able to point in the right direction.

Hi,

Solved! sort of…

Apparently calling torch.set_num_threads(1) before starting to train solves the problem. Though I have no idea why.

If anyone has any light to shed on this matter, it would be greatly appreciated!

Thanks

This might be relevant to OpenMP multithreading: Number of CPU threads for the python process · Issue #16894 · pytorch/pytorch · GitHub

Distributed launcher also does the same thing: pytorch/launch.py at 12f0052eee599ae85c78ccb22c17ae41cc221ff2 · zhaojuanmao/pytorch · GitHub

Thank you! I will look into this.