Different size for tensor grad compared to the tensor itself

Mahdi_Zaferanchi · March 26, 2024, 11:24am

I’m implementing an algorithm with a custom autograd.Function. In this algorithm, some variables have gradients with a different size from the tensor itself. I understand this is unusual, but it makes sense for my use case. But this leads to the following error:

RuntimeError: attempting to assign a gradient of size ‘…’ to a tensor of size ‘…’. Please ensure that the gradient and the tensor are the same size

Is there anyway to bypass this check?

KFrank · March 26, 2024, 9:55pm

Hi Mahdi!

Indeed.

Could you explain your use case? What you propose is quite odd.

This is certainly what one would expect mathematically. The gradient with
respect to a tensor is the set of the partial derivatives of a scalar function of
that tensor with respect to the elements of that tensor. So the elements of
the gradient should be in one-to-one correspondence with the elements of
the tensor.

Pytorch normally uses such gradients to perform optimization steps on the
tensors that are the weights of a model being trained. The simplest case
(namely plain-vanilla SGD) subtracts the learning rate times the gradient
from the weight tensor itself – an operation that requires the gradient and
tensor to have the same size.

So what is your use case and how are you hoping it might fit into pytorch’s
overall framework?

Best.

K. Frank

Mahdi_Zaferanchi · March 27, 2024, 8:30am

Thanks for the reply.

My algorithm uses extreme quantization. So in the forward pass, each bit of a tensor element is its own “unit” and needs a separate gradient. For example, an intermediate representation of shape (10,) with float32 data actually represents 320 independent bits and not 10 float numbers. But each bit still has a float gradient so I need .grad to be of shape (320,). (Batch size is one in this example.)

My custom C++ and CUDA code takes care of handling the forward pass with this unusual setup. This is much faster and memory efficient compared to using Boolean data for my use case.

If I can’t set .grad to a tensor of the size that I need, the only solution I can think of is to initialize a separate variable of the correct size just so it can hold the gradients. This will probably work, but the memory allocated for the tensor is wasted since I’m only using its .grad and not .data.

What I’m doing is inspired by this paper but the training and backward pass is handled differently there.

I had a similar problem because pytorch doesn’t allow integer tensors to have gradients. I changed my C++ functions to expect floats which I cast to integers that can be shifted and manipulated to reveal each individual bit and handle binary operations (AND, OR, etc.). But that effort is in vain now if I can’t work around this shape limitation.

You might be thinking that bits can’t have gradients because they don’t change smoothly. I can’t fully explain everything here but suffice to say that (I think ) I know what I’m doing and the algorithm has been validated in python already (in a very inefficient way by using a float for each bit).

KFrank · March 27, 2024, 6:02pm

Hi Mahdi!

Why not then just allocate a single tensor of the appropriate size for your
(weird) gradient. The elements of .grad and .data won’t match up
one-to-one anyway, so what do you gain by having your gradient tensor
be attached to your “data” tensor as its .grad property?

Yes, that is exactly what I am thinking …

You haven’t told us what you will be using your “gradient” for. I can hardly
imagine that you will be able to hook it into pytorch’s autograd framework
in any sensible way. So it would seem like creating a separate “gradient”
tensor – not attached as a .grad property to anything else – would be the
way to go.

Best.

K. Frank

Mahdi_Zaferanchi · March 29, 2024, 8:08am

That would solve my problem. At least in terms of not wasting memory.

I am using the autograd engine. That is why I need to store the gradients in a .grad.

Seems like there really isn’t a straightforward solution for me and my current workaround might be the best I can get. Perhaps apart from compiling pytorch from source and removing the shape mismatch check which I’m not even sure would work. Maybe later when I’m more confident about my algorithm and trying to optimize for performance.

Thank you for your help.