Quenstion about one trick used when the GPU memory overflow

I am training a network model but I found it’s so big such that one TITAN Xp GPU’s 12GB memory can only allow one sample to train at the same time. Now I am using one trick to mimick multi-sample training. The way is I go forward and backward the model with one sample each time and after some times I optmiizer.step() one time. I am not sure in pytorch is this way can work similiary with training with multi-samples at one time. In my network I did not use BatchNorm layer.
Thank you for your advice !

That should generally work.
Note that you might want to scale the gradients as they are accumulated by default.
Here is a good explanation with some examples.

Thanks for your reply! Is there any bad infulence if I use BatchNorm layer? If so, how should I do to reduce its hurt to performance?

Since you can only use a single sample in your forward pass, the running estimates might be off.
You could try to adjust the momentum, but I think nn.InstanceNorm or nn.GroupNorm would be a better alternative.

Ok, thank you:grinning: