Allow size mis-match in autograd forward vs. backward


I have a slightly less conventional scenario involving my custom autograd function (M). Given target of shape: [*, N, *, *], The computational graph looks like this:

> INPUT -> NN -> M.forward (out shape: [*, C, *, *]) -> Zero pad to match C = N -> Loss on all N channels

During backward I want the gradients to the module M be of shape [*, N, *, *], even though its output was having less channels. I know how to handle the extra padded dimensions in the gradient.

An option can be that I pad inside the module M, however I do not know the shape of target before-hand.



Would you have a dummy code sample that shows what you’re trying to do by any change (20 lines that we can run on colab for example)?
In particular, I am wondering what you expect to happen with the NN part of the backward, will they get gradients with different shape than the forward?

Thanks for you reply.

NN part of the backward would be oblivious to what is going on outside its box :stuck_out_tongue:. Not sure, if I can make a toy colab example, but perhaps I can elaborate more about the computational graph. Denoting modules by capital letters and variable in small letters:

input -> NN -> x -> M -> z (z has C channels)
z, z_g -> ZERO PADDING -> z_p (Now z has changed to N channels by knowing z_g the ground-truth)
(z_p, z_g) -> LOSS

Module M needs to send back the gradients just w.r.t x, but z is the variable which I am using to calculate how bad x is. In particular I would need have d Loss / dz to have N channels instead of C.


But from you schema, I would expect that the backward of the zero padding will be doing just that. And so by the time you get to M, your grads only have C channels again.

No, but I would need grad_z to have N channels coming into backward() for it to compute grad_x.

To clarify further, this feature request actually breaks what I want to do:

Although, I believe it is good to have these size checks for normal situations. I would be glad if there is some ‘hack’ to by-pass the size consistency checks. Thanks :slight_smile:

Well all the formulas make that assumption so that would most likely break anyways :confused:

No, but I would need grad_z to have N channels coming into backward() for it to compute grad_x.

Then why not fold the padding into your custom function M (making z_g (or just N) an input to the custom Function).
That way the custom function will get the gradient of the right size and can handle all of that whichever wya it wants.

I see. Then I think I have following two options:

  1. As you suggested. Making the zero padding an autograd function, whose backward would actually implement the necessary computations for M.backward(). M.backward() would possibly not do much itself.

  2. Return from M an output z padded with ‘dummy’ channels to make C > N (assuming I know an upper bound on N). The zero padding module is then not required and I can just compute the loss on valid channels of z. That would hurt me in terms of memory, but perhaps I can store the dummy channels through some sparse/empty tensor hack :stuck_out_tongue: ?


I think that merging the two together (M and padding) will be the best indeed. That makes sure you don’t need any hack in between the two

True, but the memory would be an issue. Is there anyway to concatenate ‘empty’ tensor to a valid tensor to inflate number of channels of valid tensor while keeping the memory similar? Thanks again :slight_smile:

True, but the memory would be an issue.

I’m not sure to see why.
Currently, you already have a x → M → z → PADDING → z_p
I think you want (x, z_g) → M_AND_PADDING → z_p

And in that new custom Function, you don’t need to do anything beyond what the padding is currently doing.

Yes, but making it generic would require me to know the upper bound on N. But, on hind-sight you are right, that upper-bound might still become tight for some samples in the batch anyway.

Thanks for advices :slight_smile: I will give a try. Have a nice day!

1 Like