I have Unet, size_batch = 16 and images that have (512, 512, 3) shape.
My training doesn’t progress and nvidia-smi shows me that both my GPU’s have 100 volatile gpu-util. I waited around 30 minutes no one batch was progress…
What happened?
I have Unet, size_batch = 16 and images that have (512, 512, 3) shape.
My training doesn’t progress and nvidia-smi shows me that both my GPU’s have 100 volatile gpu-util. I waited around 30 minutes no one batch was progress…
What happened?
Hello, @Oktai15. I also came across a similar problem while training a segmentation network using semantic-segmentation-pytorch. After running for some random number of epochs, all volatile GPU util becomes 100% and the training seems to be stuck.
Did you fix the problem or have any thoughts about it?
Hello, @Oktai15 , Me too have a similar problem, any info about it?
I came across the same problem, did you solve it ?
same problem today. Script working fine without DDP. When using DDP, all 4 GPUs are 100% and no progress (almost no CPU activity either, and GPU at mid-power,)