Volatile gpu-util is 100 and no progress

Oktai15 · April 6, 2018, 9:38pm

I have Unet, size_batch = 16 and images that have (512, 512, 3) shape.
My training doesn’t progress and nvidia-smi shows me that both my GPU’s have 100 volatile gpu-util. I waited around 30 minutes no one batch was progress…

What happened?

jianchao-li · October 17, 2018, 3:40am

Hello, @Oktai15. I also came across a similar problem while training a segmentation network using semantic-segmentation-pytorch. After running for some random number of epochs, all volatile GPU util becomes 100% and the training seems to be stuck.

Did you fix the problem or have any thoughts about it?

adhi54 · August 24, 2019, 8:08pm

Hello, @Oktai15 , Me too have a similar problem, any info about it?

LucasX · September 3, 2019, 3:19am

I came across the same problem, did you solve it ?

Olivier-CR · January 14, 2022, 10:28am

same problem today. Script working fine without DDP. When using DDP, all 4 GPUs are 100% and no progress (almost no CPU activity either, and GPU at mid-power,)