Allow size mis-match in autograd forward vs. backward

A_A · February 10, 2021, 7:46am

Hello,

I have a slightly less conventional scenario involving my custom autograd function (M). Given target of shape: [*, N, *, *], The computational graph looks like this:

> INPUT -> NN -> M.forward (out shape: [*, C, *, *]) -> Zero pad to match C = N -> Loss on all N channels

During backward I want the gradients to the module M be of shape [*, N, *, *], even though its output was having less channels. I know how to handle the extra padded dimensions in the gradient.

An option can be that I pad inside the module M, however I do not know the shape of target before-hand.

Thanks!

albanD · February 10, 2021, 4:13pm

Hi,

Would you have a dummy code sample that shows what you’re trying to do by any change (20 lines that we can run on colab for example)?
In particular, I am wondering what you expect to happen with the NN part of the backward, will they get gradients with different shape than the forward?

A_A · February 10, 2021, 7:18pm

Thanks for you reply.

NN part of the backward would be oblivious to what is going on outside its box . Not sure, if I can make a toy colab example, but perhaps I can elaborate more about the computational graph. Denoting modules by capital letters and variable in small letters:

input -> NN -> x -> M -> z (z has C channels)
z, z_g -> ZERO PADDING -> z_p (Now z has changed to N channels by knowing z_g the ground-truth)
(z_p, z_g) -> LOSS

Module M needs to send back the gradients just w.r.t x, but z is the variable which I am using to calculate how bad x is. In particular I would need have d Loss / dz to have N channels instead of C.

albanD · February 10, 2021, 7:35pm

Ok,

But from you schema, I would expect that the backward of the zero padding will be doing just that. And so by the time you get to M, your grads only have C channels again.

A_A · February 10, 2021, 7:41pm

No, but I would need grad_z to have N channels coming into backward() for it to compute grad_x.

A_A · February 10, 2021, 7:45pm

To clarify further, this feature request actually breaks what I want to do:

Although, I believe it is good to have these size checks for normal situations. I would be glad if there is some ‘hack’ to by-pass the size consistency checks. Thanks

albanD · February 10, 2021, 8:40pm

Well all the formulas make that assumption so that would most likely break anyways

No, but I would need grad_z to have N channels coming into backward() for it to compute grad_x.

Then why not fold the padding into your custom function M (making z_g (or just N) an input to the custom Function).
That way the custom function will get the gradient of the right size and can handle all of that whichever wya it wants.

A_A · February 10, 2021, 8:53pm

I see. Then I think I have following two options:

As you suggested. Making the zero padding an autograd function, whose backward would actually implement the necessary computations for M.backward(). M.backward() would possibly not do much itself.
Return from M an output z padded with ‘dummy’ channels to make C > N (assuming I know an upper bound on N). The zero padding module is then not required and I can just compute the loss on valid channels of z. That would hurt me in terms of memory, but perhaps I can store the dummy channels through some sparse/empty tensor hack ?

Thanks.

albanD · February 10, 2021, 9:09pm

I think that merging the two together (M and padding) will be the best indeed. That makes sure you don’t need any hack in between the two

A_A · February 10, 2021, 9:19pm

True, but the memory would be an issue. Is there anyway to concatenate ‘empty’ tensor to a valid tensor to inflate number of channels of valid tensor while keeping the memory similar? Thanks again

albanD · February 10, 2021, 9:21pm

True, but the memory would be an issue.

I’m not sure to see why.
Currently, you already have a x → M → z → PADDING → z_p
I think you want (x, z_g) → M_AND_PADDING → z_p

And in that new custom Function, you don’t need to do anything beyond what the padding is currently doing.

A_A · February 10, 2021, 9:32pm

Yes, but making it generic would require me to know the upper bound on N. But, on hind-sight you are right, that upper-bound might still become tight for some samples in the batch anyway.

Thanks for advices I will give a try. Have a nice day!