Associativity during backprop


Before trying to reinvent the wheel for my model I would like to check if something like the following
is already implemented.
Does the current implementation tries to perform the most optimal set of tensor product
based on the size of parent/children tensors?
A simple example is a product between the following matrices A * B* C = (100 x 100) * (100 x 10) * (10 x 1).
If I’m correct this should be a relatively common operation, especially during the backward-pass
due to the chain rule.
Performing (A*B) * C should result in 200900 flops, while A * (B * C) only 21800.


No we don’t actually try to do that.
The main reason is that we almost never actually do such vector matrix product.
The backward pass implementation is given v and computes v^T Jf with Jf being the Jacobian of the current function f.
So we don’t actually need to materialize Jf ever.

For example, if you consider the relu operation for a 1D input of size n. The Jacobian would be an n^2 matrix.
But what the backward pass does is just mask out some entries in v as: v[inp<0] = 0.

For example, if you consider the relu operation for a 1D input of size n. The Jacobian would be an n^2 matrix.
But what the backward pass does is just mask out some entries in v as: v[inp<0] = 0 .

I see, but what if you have a convolution? Correct me if I’m wrong, but at that point, you would have to convolve the convolution output with the respective input region for taking the gradient w.r.t the weights right?

What if the weights are functions say of a tensor product of a smaller amount of parameters (Imagine a separable 2d convolution, where I decide to compute the kernel using the tensor product)? At that point, pytorch would first compute the gradient for each weight of the full 2d kernel, and then the result would be multiplied by the gradient of the tensor, which is similar to the (A * B) * C operation above, right?

Anyways, thank you for your time answering my question. If I see that the method above applies to my situation I definitely implement it.


Yes in theory it does, but in practice, since none of these matrices are actually materialized, you might not get the speedup you would think by just looking at the matrix products.
In particular the convolution backward is a transposed convolution which might be tricky to adapt to do what you want to do

What do you mean by none of these matrices are materialized?

I mean that the Jacobian matrix is never actually created. And even though this matrix is n x n in size, we can do the dot product with a vector of size n in time linear in n instead of n^2 (using the sparsity of the matrix).
But the logic you talk about above about optimizing matrix products is based on the idea that regular matrix product is n^2 complexity.

I am not saying you cannot do it, but you might not see the benefits you expect.

I see, thank you for your answer and advice!

At this point is just out of curiosity.
Even though torch doesn’t materialize those matrices, this amount of operations needs to be done one way or another. I guess what you mean, is that this matrix doesn’t appear as a contiguous array in memory, is that right? If not, I don’t know what would be the alternative. I would love to know more details if it doesn’t waste too much of your time on this specific matter.

If the matrix is sparse, you don’t actually have to :smiley:

For example if you have a vector v of size n and a matrix A of size nxn.
If you know that A is diagonal. You can perform the vector matrix product by just storing a the diagonal of a. And performing an element-wise product of the n elements of v and a.
So you get the same results as v^T A but by doing n operations instead of n^2.

Does that make sense?

Yes, it makes sense, but I don’t understand the assumption of why the matrix would be sparse most of the cases. For old-fashioned convolutions/models/datasets, sparsity seems not so common if not “model induced”.
So I guess that’s the reason why I couldn’t understand your advice :slight_smile: .
But maybe you are telling me that most of the feature maps are indeed sparse, and under the hood pytorch does lots of checks in this regard. So apart the first layer, everything else is sparsified.

Actually a convolution is very very sparse.
In the setting where you consider the input as 1D and output as 1D where you just collapse all the dimensions, The Jacobian would be huge (or even writing the convolution as a mm with 1D inputs).
But the weights are both re-used and applied only to subsets of the channels. Leading to a significant sparsity in that operation.

I see what you mean. Thank you for your time!