I am working on PEFT techniques of LLMs. An idea which I wanted to try out was to try out full finetuning while only propogating the maximum gradient from any particular node. That is, given a node N in the computation graph on which N.backward() has been called, first compute the gradients for the children of N, and then only pass the gradient and call backward() on the child having the max gradient (take L1 norm in case of non-scalar parameters). This would significantly reduce the backward() pass overhead.
However, this is proving to be very difficult because pytorch’s autograd uses the C++ engine as its backend and that too its compiled binaries. Also it is tricky to figure out where to make the change (I tried looking into the codebase but it seems that the backend does not do the backward pass recursively but rather iteratively using a multitude of datastructures), so there is no clean way of not calling backward on the nodes having non max gradient.
Some guidance on how to achieve the desired functionality would be helpful. Please note that speed here is not an issue, I just want to assess how the model trained (or finetuned) using such a technique would perform. Hence it would be fine if the backend used is a python engine. I however, can’t seem to figure out how to do this.
Depending upon the details of what you want to do, you might be able to
use autograd.grad(). Note, this is part of pytorch’s python interface, so
there would be no need to drill down into the C++ implementation.
However, it’s not clear to me exactly what you want. Let’s say that the last
layer of your model is a Linear and you pass its output to some loss function,
and you imagine calling loss.backward(). loss will typically have one child,
namely the output of your final Linear. Conceptually, this will have three
children – the Linear’s weight, its bias, and the input to that Linear, likely
the output of the previous layer.
What specifically do you propose to do? The weight and bias have no
children, but are trainable, while the input has children – the parameters
of the previous layer – but is not, itself, trainable.
Let’s say that of these three children, weight turns out to have the largest
gradient. Do you now populate weight.grad, but otherwise backpropagate
no further? On the other hand, let’s say that the Linear’s input has the
largest gradient. Do you then continue your “max-gradient” backpropagation
through the previous layer, but not accumulate a gradient into weight.grad
nor fine-tune weight?
Also let’s say that if you were to perform the full backpropagation, you would
discover, for example, that the bias of the first layer has a gradient larger
than the weight of the last layer. If you stopped backpropagation at the last
layer as soon as you determined that the last layer’s weight had the largest
gradient of the three children in question, you would never discover that the
first layer’s weight had a larger gradient. Would such a scenario fit properly
into your use case?
During back propogation, when finding derivative of loss w.r.t the current parameter, only take the scalar gradients (where scalars are the individual elements in the tensor parameter) having magnitude in top k of the scalar gradients. Fix the rest to 0.
Note, that doing this during optimizer.step() would not be correct, a naive way of setting the param.grad to 0 for all but the top k magnitude gradients in the current param would not produce the required effect since the all the params in the succeeding layers have been used for gradient computation, when only the top k gradients were supposed to be used.
As a result I would need to modify the backward() pass to set the non topk gradients to 0 at each node. Also note that one motivation for this is increase the speed of the backward pass. And hence it would make sense to pass sparse gradients (since most would be set to 0) to the children which would then do the same thing and so on.
I am not quite able to figure out how to achieve this. Since speed is crucial, I may also have to modify the C++ engine. I need to do this to compare the speed increase compared to the “dense” gradient computation (ie. the default autograd behaviour). Note that this is different from my previous question where speed was not an issue.
You should be able to do this by registering a hook on each node to which
you want to pass zeroed-out gradients. The hook would perform the topk
computation and create a new gradient tensor with the non-topk entries set
to zero.
I don’t whether there is a clean way to do this. If you pass a gradient that is
packaged as a full tensor into a backward hook and have hook return a
sparse-tensor gradient, autograd will complain. I don’t know whether you can
arrange things so that nothing but sparse gradients flow through you hooks.
You can pass a sparse tensor into autograd.grad() as its grad_outputs
argument (the vector in the vector-jacobian product that autograd.grad()
computes), so perhaps if you perform the backward pass “manually” by
chaining together a bunch of autograd.grad() calls, you could work with
sparse gradients.
Having said that, working with sparse gradients may or may not give you
any speed-up (and could even slow things down). The reason is that even
though your sparse-gradient computations will use (potentially significantly)
fewer floating-point operations, the gpu (and even cpu) floating-point pipelines
are so heavily optimized for structured floating-point arithmetic (e.g., matrix
multiplication) that the speed they lose on the less-structured sparse-tensor
operations could more than undo any speed benefit of doing “less” work. The
only good way I know to evaluate these trade-offs is to time concrete code
(and those relative timings may differ significantly across hardware platforms).