Hello everyone,

I have a -nooby- question about gradient computation. Let’s say I have an input matrix `X` with shape of `[T, T]`. I want to process this matrix in a block diagonal way, where these blocks will have a shape of `[B, B]` with (`B << T`).
For the first implementation and to check if my logic make sense, I went with a simple for loop and in each step modify only one block of `X` (add a dependency factor) and nothing else. The grad_fn attribute shows `(<CopySlices object at 0x7f76c424a070>, 0)` and I assume that the backpropagation is working as intended.
This was extremely slow, so for the second implementation I went with a mask approach. My mask is known for a given `Χ`, so I precomputed it. I compute the dependency factor for each item in the matrix (`[Τ, Τ]`), but I want to keep only the masked factor (to be added in a block diagonal way to `Χ`). Finally, I compute `X = X + (mask * factor)`). This approach give `(<MulBackward0 object at 0x7f76c4283be0>, 0))`.

My question is: The two results I get from those implementations is slightly different (max absolute error `2e-5`). Is that because of numeric reasons? OR Is that because on the second approach I impact every item on the `[T, T]` matrix, even if outside the blocks the dependency factor I add is zero?

Thanks in advance for any help and insight you can provide!

Hi George!

I believe that your discrepancy is due to (accumulated) round-off
error.

An easy way to test this is to repeat your computation using double
precision (by using `torch.double` tensors). If the discrepancy is due
to round-off error it should drop dramatically, say to about 1e-13. If the
discrepancy were to remain around 2e-5, then it would be due to the
two computations not being mathematically equivalent (or to some bug
somewhere).

be mathematically equivalent (but without full details, I can’t really say
for sure).

As an aside, do you ever work with the off-block-diagonal parts of your
tensor `X`? If not, you might consider creating, storing, and processing
only the `[B, B]`-shaped blocks themselves. Given that `B << T`, doing
so could result in significant time and memory savings.

Best.

K. Frank

Hi Frank,
Firstly thank you for your answer. You validate to me, that it was indeed a bug and not something I didn’t understand about autograd!

An easy way to test this is to repeat your computation using double
precision (by using torch.double tensors). If the discrepancy is due
to round-off error it should drop dramatically, say to about 1e-13. If the
discrepancy were to remain around 2e-5, then it would be due to the
two computations not being mathematically equivalent (or to some bug
somewhere).

As I said before the behavior I documented it was because of a nasty little bug. In my tries to make it faster I found an error in my logic (broadcasting is to blame… or me ). After I fixed that, the error drop at `1e-10`, even without double tensors!

As an aside, do you ever work with the off-block-diagonal parts of your
tensor `X` ? If not, you might consider creating, storing, and processing
only the `[B, B]` -shaped blocks themselves. Given that `B << T` , doing
so could result in significant time and memory savings.

At least for now, I want all the `[T, T]` matrix with a bit processing in those blocks for the diagonal. If the logic of the architecture change I would definitely do that to improve my memory usage. The time savings is already more than enough, after I get rid of that slow for loop!

PS. Thanks again for your help. I’ll mark your answer as the solution, because it was indeed a bug!