How to mask gradient computations in backward?


I am trying to learn a matrix E that maps a vector W to a vector C, and a matrix D that maps the vector C to to a vector W2. However, I only want to update parameters of E and D for certain combinations w, c and c, w respectively.

In this way, I would like the backward() function on my loss to not compute any gradient for the parameters that are outside of the scope of the certain combinations (which I have defined in an adjacency matrix).

It seems like some type of sparse Variable would have this behavior, but I do not think that is implemented in Pytorch. Instead, does anyone know a trick for ‘masking’ these parameter matrices? I can simply 0 out the gradients for the parameter values that I know I want to ignore, or even more simply create a mask for each parameter matrix and compute, e.g. E * E_mask.

But this does not ensure that my gradients that persist are correct, because backprop will compute each gradient w.r.t. every parameter, originally, before I am able to 0 anything out.

backward hook is probably what you are looking for

1 Like

Thanks! I think that did the trick, i.e. registering a hook function for each matrix:

def _E_hook(grad):
    return grad * Variable(self.E_mask)

def _D_hook(grad):
    return grad * Variable(self.D_mask)

Just need to convince myself that computing the gradients of, e.g.
output w.r.t D and then zeroing out the D gradients before backpropogating is equivalent to never computing gradients for certain dimensions of D in the first place.

1 Like