Mixing mixed-precision training and full precision inference in dataloader

mart_fire · March 31, 2022, 10:13am

Hi, I have a problem.
So far I was using mixed precision on GPU in my training loop with amp.autocast and everything went well.
However I started using a Pytorch network in my dataloader (I use RetinaFace to extract landmarks and align my images). RetinaFace runs on full precision on CPU. But now I can’t use mixed-precision in my training loop anymore.
I got the error element 0 of tensors does not require grad and does not have a grad_fn when I call backward() of my loss. But when I print the attribute requires_grad of my loss, it is set to True.
I don’t get any error if I run my training loop without mixed precision.

ptrblck · April 1, 2022, 4:53am

Could you post a minimal, executable code snippet to reproduce the issue, please?

mart_fire · April 1, 2022, 9:59am

I tried to make an minimal example but I didn’t get the error.
I resolved my error but I don’t know why.
Instead of initializing the network RetinaFace (actually it is a class with contains the true neural network (a nn.Module) and loads it in its constructor ) in the constructor of my dataset, I set it to None and initialize it during the loading of the data (I check if it is equals to None). Because there are multiple workers it is initialized multiple times (I guess the network is erased multiple times because it’s an attribute of the Dataset) , from the printing it looks like it is initialized 2 times the number of workers (why 2 times ?).

So I don’t know if it was actually linked to mixed precision or due to different processing times.