TBPTT through the product of two parameters

Hello,
This is my first question and I wanted to say thanks for all the good answers I have found here.

I am having trouble backpropagating through an activation I am testing.

    self.a = Parameter(torch.mul(torch.ones(1,N),.5))
    self.g = Parameter(torch.rand(1,N))
    self.a_list = []
    self.g_list = []
    
def forward(self, x, z):
    return (1-self.a)*x + self.a*self.g*torch.tanh(x+z)

I would like to learn parameters a and g however the product of self.a and self.g is an issue during backprop since self.a is changed in place before the new self.g is calculated.

Inputs x and z are outputs from separate linear layers.

I have attempted different cloning schemes for these parameters among other things.

As of now (different from the lines above), I am treating them as states similar to LSTM’s “hidden” and passing it through the full network repeatedly until calulating loss on the last prediction.

Is there better way to do this? Also, aside from using make_dot() is there a tool to see how the graph deletes during backprop. From what I have seen it looks like the list and deletions form from left to right and this is why I believe self.a is being changed first.

Thanks for your time!

Going through the autodiff tutorials and other online discussions I came up with a function which specifically handles the multiplication of the two learnable parameters and returns the partial derivative with respect to each.

class Pmul(torch.autograd.Function):
@staticmethod
def forward(ctx, a, g):
ctx.save_for_backward(a,g)
return torch.mul(a,g)

@staticmethod
def backward(ctx, grad_output):
    a,g = ctx.saved_tensors
    da = grad_output*g 
    dg = grad_output*a
    return da, dg

With this I alias as pmul = Pmul.apply and then replace,

…+ self.a * self.g * torch.tanh(…)
with
…+ pmul(self.a,self.g) * torch.tanh(…)

I am testing it more but it seems to work. Any feedback is much appreciated.

Thanks in advance!