Slicing - Masking and gradient computation

mpalaourg · November 3, 2021, 12:22pm

Hello everyone,

I have a -nooby- question about gradient computation. Let’s say I have an input matrix X with shape of [T, T]. I want to process this matrix in a block diagonal way, where these blocks will have a shape of [B, B] with (B << T).
For the first implementation and to check if my logic make sense, I went with a simple for loop and in each step modify only one block of X (add a dependency factor) and nothing else. The grad_fn attribute shows (<CopySlices object at 0x7f76c424a070>, 0) and I assume that the backpropagation is working as intended.
This was extremely slow, so for the second implementation I went with a mask approach. My mask is known for a given Χ, so I precomputed it. I compute the dependency factor for each item in the matrix ([Τ, Τ]), but I want to keep only the masked factor (to be added in a block diagonal way to Χ). Finally, I compute X = X + (mask * factor)). This approach give (<MulBackward0 object at 0x7f76c4283be0>, 0)).

My question is: The two results I get from those implementations is slightly different (max absolute error 2e-5). Is that because of numeric reasons? OR Is that because on the second approach I impact every item on the [T, T] matrix, even if outside the blocks the dependency factor I add is zero?

Thanks in advance for any help and insight you can provide!

KFrank · November 3, 2021, 2:22pm

Hi George!

I believe that your discrepancy is due to (accumulated) round-off
error.

An easy way to test this is to repeat your computation using double
precision (by using torch.double tensors). If the discrepancy is due
to round-off error it should drop dramatically, say to about 1e-13. If the
discrepancy were to remain around 2e-5, then it would be due to the
two computations not being mathematically equivalent (or to some bug
somewhere).

Based on your description, I believe that your two computations should
be mathematically equivalent (but without full details, I can’t really say
for sure).

As an aside, do you ever work with the off-block-diagonal parts of your
tensor X? If not, you might consider creating, storing, and processing
only the [B, B]-shaped blocks themselves. Given that B << T, doing
so could result in significant time and memory savings.

Best.

K. Frank

mpalaourg · November 4, 2021, 8:11am

Hi Frank,
Firstly thank you for your answer. You validate to me, that it was indeed a bug and not something I didn’t understand about autograd!

An easy way to test this is to repeat your computation using double
precision (by using torch.double tensors). If the discrepancy is due
to round-off error it should drop dramatically, say to about 1e-13. If the
discrepancy were to remain around 2e-5, then it would be due to the
two computations not being mathematically equivalent (or to some bug
somewhere).

As I said before the behavior I documented it was because of a nasty little bug. In my tries to make it faster I found an error in my logic (broadcasting is to blame… or me ). After I fixed that, the error drop at 1e-10, even without double tensors!

As an aside, do you ever work with the off-block-diagonal parts of your
tensor X ? If not, you might consider creating, storing, and processing
only the [B, B] -shaped blocks themselves. Given that B << T , doing
so could result in significant time and memory savings.

At least for now, I want all the [T, T] matrix with a bit processing in those blocks for the diagonal. If the logic of the architecture change I would definitely do that to improve my memory usage. The time savings is already more than enough, after I get rid of that slow for loop!

PS. Thanks again for your help. I’ll mark your answer as the solution, because it was indeed a bug!