Hi,
I am planning to implement a network which does not store the activations/gradients for a couple of layers, and recomputes them on the fly for the backward pass. I am currently using convolutional layers from torch.nn. Please let me know how and where I should change the implementation so that gradients will not be stored for all the layers, and activations also will not be stored for all the layers.
Yes, this is very relevant, thank you!