Your system is probably having trouble supplying data fast enough to saturate the compute on all 4 GPUs. How many workers are you using for the dataloader?
I’m having a similar issue – using 8 worker in torch.utils.data.DataLoader, and Volatile GPU-Util is jumping between 0% and 80% ~ every half second. Anything I can do to improve this? Can send code if it’d be helpful.
I met the same problem with only one GPU(NVIDIA GTX 1080 Ti) while training on ImageNet dataset with ImageFolder API.
In my case, the dataset images are on my Seagate 2TB HDD so the problem may partly caused by the slow reading speed of HDD. Probably putting the dataset on a SSD may help a lot.
However I found it got better when setting shuffle=False and num_workers larger but the problem still happened sometime. I think the shuffle procedure in Torch may also be the bottleneck.