I have a slightly less conventional scenario involving my custom autograd function (M). Given target of shape: [*, N, *, *], The computational graph looks like this:
> INPUT -> NN -> M.forward (out shape: [*, C, *, *]) -> Zero pad to match C = N -> Loss on all N channels
During backward I want the gradients to the module M be of shape [*, N, *, *], even though its output was having less channels. I know how to handle the extra padded dimensions in the gradient.
An option can be that I pad inside the module M, however I do not know the shape of target before-hand.
Would you have a dummy code sample that shows what you’re trying to do by any change (20 lines that we can run on colab for example)?
In particular, I am wondering what you expect to happen with the NN part of the backward, will they get gradients with different shape than the forward?
NN part of the backward would be oblivious to what is going on outside its box . Not sure, if I can make a toy colab example, but perhaps I can elaborate more about the computational graph. Denoting modules by capital letters and variable in small letters:
input -> NN -> x -> M -> z (z has C channels)
z, z_g -> ZERO PADDING -> z_p (Now z has changed to N channels by knowing z_g the ground-truth)
(z_p, z_g) -> LOSS
Module M needs to send back the gradients just w.r.t x, but z is the variable which I am using to calculate how bad x is. In particular I would need have d Loss / dz to have N channels instead of C.
But from you schema, I would expect that the backward of the zero padding will be doing just that. And so by the time you get to M, your grads only have C channels again.
To clarify further, this feature request actually breaks what I want to do:
Although, I believe it is good to have these size checks for normal situations. I would be glad if there is some ‘hack’ to by-pass the size consistency checks. Thanks
Well all the formulas make that assumption so that would most likely break anyways
No, but I would need grad_z to have N channels coming into backward() for it to compute grad_x.
Then why not fold the padding into your custom function M (making z_g (or just N) an input to the custom Function).
That way the custom function will get the gradient of the right size and can handle all of that whichever wya it wants.
As you suggested. Making the zero padding an autograd function, whose backward would actually implement the necessary computations for M.backward(). M.backward() would possibly not do much itself.
Return from M an output z padded with ‘dummy’ channels to make C > N (assuming I know an upper bound on N). The zero padding module is then not required and I can just compute the loss on valid channels of z. That would hurt me in terms of memory, but perhaps I can store the dummy channels through some sparse/empty tensor hack ?
True, but the memory would be an issue. Is there anyway to concatenate ‘empty’ tensor to a valid tensor to inflate number of channels of valid tensor while keeping the memory similar? Thanks again
Yes, but making it generic would require me to know the upper bound on N. But, on hind-sight you are right, that upper-bound might still become tight for some samples in the batch anyway.
Thanks for advices I will give a try. Have a nice day!