I am having problems using the checkpoint_sequential function. Using the documentation, I define my model as
model = nn.Sequential(...)
input_var = checkpoint_sequential(model, chunks, input_var)
But when I start training I get the following error:
“None of the inputs have requires_grad=True. Gradients will be None”
I am guessing what may be happening is that it tries to checkpoint the input, which would not have a gradient. Is this a bug? Do I need to worry about it? Is there a workaround?
There was a discussion about this recently, but with no apparent resolution:
@ptrblck Thanks for the quick reply! I linked that thread in my OP, but I dont think there are any solutions there for checkpoint_sequential only for checkpoint.
Is this anything more than a cosmetic issue? Like does this break my backpropagation or just complains that my input has no gradients, so it cannot checkpoint that which it shouldn’t in the first place.
The error message is unclear:
“UserWarning: None of the inputs have requires_grad=True. Gradients will be None”
does this apply to just the one layer (the input) which I dont care about or all subsequent layers?
I assume no gradients will be calculated at all.
You could run a quick test using a constant tensor, perform an update step, and compare the new output using the same input tensor. Both outputs should be equal, if the model was not updated.
The simple workaround for the nn.Sequential container would be to set the requires_grad attribute of the input to True, as the dummy input approach would be a bit more complicated.
Setting input.requires_grad=True, will make the warning go away and backprop work again that is true. However, calculating/storing the gradients for the Input should not be necessary as you dont need that for standard backprop and would also be a huge memory requirement as well, defeating the whole purpose of using checkpoint_sequential in the first place.
Based on the linked tutorial from the other post it seems to be a known limitation.
How large is your input, that it would be a huge waste of memory?
E.g. an input tensor of [64, 3, 224, 224] in float32 would need ~36.75MB.