Volatile gpu-util is 100 and no progress

I have Unet, size_batch = 16 and images that have (512, 512, 3) shape.
My training doesn’t progress and nvidia-smi shows me that both my GPU’s have 100 volatile gpu-util. I waited around 30 minutes no one batch was progress…

What happened?

1 Like

Hello, @Oktai15. I also came across a similar problem while training a segmentation network using semantic-segmentation-pytorch. After running for some random number of epochs, all volatile GPU util becomes 100% and the training seems to be stuck.

Did you fix the problem or have any thoughts about it?

Hello, @Oktai15 , Me too have a similar problem, any info about it?

I came across the same problem, did you solve it ?

same problem today. Script working fine without DDP. When using DDP, all 4 GPUs are 100% and no progress (almost no CPU activity either, and GPU at mid-power,)