Automatic Differentiation for RNN Leads to Issues

In building an RNN from scratch, I am getting a “one of the variables needed for gradient computation has been modified by an inplace operation” error. I suspect this is because the latent outputs which also serve as inputs are being modified by the movement of the weights down the gradients, but am not entirely sure. The error is contained within the forward pass of the network, so I’ll share the network class:

class RNN(nn.Module):

    def __init__(self, input_size, hidden_size=8):
        super().__init__()
        self.W = nn.Linear(input_size, hidden_size + 1)
        self.U = nn.Linear(hidden_size, hidden_size + 1) # if this doesn't work, try o_t = g(w_t, w_o)
        self.activation = nn.Tanh()

    def forward(self, input, hidden):
        x = self.W(input)
        x += self.U(hidden)
        x = self.activation(x)
        return x

I’ve played around with it a little, and the pathology is that we’re calling self.U(hidden) to compute the network output. Also, the variable the error refers to is self.U.

May someone help me with this particular issue and possibly suggest resources for learning about the intricacies of autograd so I am able to know exactly what complications are hidden when issues like these arise?

While it probably won’t solve your issue, why do you have hidden_size + 1. This will cause problems after the first iteration when you input the previous hidden state as current hidden state, since the dimensions won’t match.

My working vanilla implementation of an RNN cell looks as follows:

class VanillaRNNCell(nn.Module):
    
    def __init__(self, input_size, hidden_size):
        super().__init__()
        self.Wxh = nn.Linear(input_size, hidden_size)
        self.Whh = nn.Linear(hidden_size, hidden_size)
        
    def forward(self, inputs, hidden):
        return torch.tanh(self.Wxh(inputs) + self.Whh(hidden))  

with a small working example in this notebook.

Hello Chris! Thank you for your response! I was trying outputting the hidden values using all but the last output values of the layer, and the actual output sequence output using the last (once I get it working, I want to apply some more transformations to the last output). I’m new to RNN’s so this was a bit of an experiment.

However, I think you’re right in that this won’t lead to fundamental differences. That being said, I’ll try this more normal implementation to see if it makes a differences for autograd.

I’ve had that error, but can’t remember why, sorry.

If it helps, here is my standard RNN, it is a drop n replacement for the Pytorch one so it supports many layers (not all that useful and it complicates the code a lot).

# a vanilla RNN                                                                           
# from https://pytorch.org/docs/stable/generated/torch.nn.RNN.html
bias=False                        
class RNNbase(nn.Module):
  def __init__(self, ninp, nhid, nlayer):
    super().__init__()
    self.nlayer = nlayer
    self.fih = nn.ModuleList([nn.Linear(ninp if n == 0 else nhid, nhid, bias=bias) for n \
in range(nlayer)])
    self.fhh = nn.ModuleList([nn.Linear(nhid, nhid, bias=bias) for n in range(nlayer)])

  def forward(self, x, h_0=None, batch_first=False):
    if batch_first:
      x = x.transpose(0, 1)
    seq_len, nbatch, _ = x.size()
    h_t_minus_1 = h_0
    h_t = h_0
    output = []
    for t in range(seq_len):
      for layer in range(self.nlayer):
        ih_input = x[t] if layer == 0 else h_t[layer-1]
        h_t[layer] = torch.tanh(self.fih[layer](ih_input) + self.fhh[layer](h_t_minus_1[l\
ayer]))
      output.append(h_t[-1])
      h_t_minus_1 = h_t
    output = torch.stack(output)
    if batch_first:
      output = output.transpose(0, 1)
    return output, h_t

Hello Tony! Thank you for sharing!

I determined thus far that it might have to do with how I train the RNN: Before, I was training as I computed outputs per time step (MSE was a square of a single difference at a single time step) while passing “retain_graph=True” in the .backward() call. Now, I train at the end of the computations, and it works fine. Is this what you do?

I would prefer to be able to train at each time step, as I go through the data, as I believe this would speed up convergence.

I’m glad you got it working. Personally I wouldn’t worry about going back prop at every home step. Let me give you two reasons to convince you. Firstly, every output can depend on any previous input in the computation graph, so if you back prop at every time step back to the start you have O(n^2) computation, but do it once per batch and it’s O(n) as the gradients add. Secondly, I find that it’s most efficient to batch about 512 sequences up, where every sequence is about length 32. That is, AdamW likes fairly big batch sizes, if they are too small then there’s too much noise and you shake about over the minima instead of converging on it. Hope that helps, I’m convinced that we don’t understand transformers and that deep RNN variants have a lot more to give if only we put the effort in.