RuntimeError: Function returned nan values in its 0th output

Hi,

I am getting below error when trying to train the network “Function ‘’ returned nan values in its 0th output”. Along with this I have enabled “torch.autograd.set_detect_anomaly(True)”

but if I use grouped convolution, the network works completely fine. Btw, I am leveraging single GPU to train the network, and as per my understanding this should have similar effect with or without grouped convolution on single GPU (please correct my understanding)

I don’t quite understand the issue and don’t know how grouped convolutions fit into the use case.
Did you check why the invalid outputs are created? I.e. is your model training exploding?