Hi, so if I want to learn a matrix M, that will transform an image I into I’ that is ideally very close to the ground_truth image I_g, the forward pass is the following:

M = [1scale 0 dx
0 1scale dy]

The only variables in M are scale, dx, dy

I*M = I’
loss = I_g - I’

note that this * is actually using grid_sample, and grid_affine but it doesn’t matter here

when I back prop the loss though the gradient will first go to M, and then to dx,dy, scale.
I don’t want it to consider the entire matrix of M as a parameter to update, because for example the first row second column’s value is a fixed 0. For the moment though, its considering 0 to be updatable and I don’t want that. However, if I do M.detach() then dx,dy, and scale won’t be updated/learned…

You’re right I just realized that the gradients in the a and e position sum together to the scale gradient. Thanks! But since I’m here I wanted to ask a related question - why is it that when the gradients are largest, the values are affected the least? I’m noticing that when I plot my gradients of x, the larger they are (they oscillate between super big and super small) yet the value of x doesn’t change at all… I’m using an Adam optimizer and was wondering if thats why? Not sure how Adam affects it but just a guess…

When you say the value doesn’t change at all, you mean after doing the optimizer step?
If so yes, that might be due to Adam that regularizes a lot the trajectory if it is very noisy. You can try with SGD to see if you get the same behavior.