Passing hidden state forwards in vanilla RNN

Hello guys, I’m working on translating a lab I did in Matlab to work in Pytorch.
The model is supposed to be a vanilla RNN that synthesizes text based on a book it trains on.

Most guides I’ve come across that are using the vanilla RNN module doesn’t seem to be passing the hidden state from the previous steps over to the next, and they’re also using “batches” in the input, something I haven’t worked with before. My approach have been to modify these guides as setting batch size to 1, and also passing the previous hidden state over to the next while synthesizing and while training, only resetting when entering a new epoch.

The guide I’ve been following: guide

However, I’ve run into a problem and I’d be really grateful if I could get some advice:

# Actually building the network
class VanillaRNN(nn.Module):
    def __init__(self, input_size, output_size, hidden_size, n_layers):
        super().__init__()
        self.hidden_size = hidden_size
        self.n_layers = n_layers
        self.rnn = nn.RNN(input_size, hidden_size, n_layers, batch_first=True)   
        self.fc = nn.Linear(hidden_size, len(chars))
        
        # Defined myself
        self.should_init_hidden = True

        # Added a hidden state parameter
    def forward(self, x, h=None):
            batch_size = x.size(0)
            #print("batch size: " + str(batch_size))

            # If first forward pass in training/while synthesizing 
            if self.should_init_hidden:
                hidden = self.init_hidden(batch_size)
            # Else use the provided parameter
            else:
                hidden = h

            # Passing in the input and hidden state into the model and obtaining outputs
            out, hidden = self.rnn(x, hidden)

            # Reshaping the outputs such that it can be fit into the fully connected layer
            out = out.contiguous().view(-1, self.hidden_size)
            out = self.fc(out)

            return out, hidden      
        
    def init_hidden(self, batch_size):
        # This method generates the first hidden state of zeros which we'll use in the forward pass
        # We'll send the tensor holding the hidden state to the device we specified earlier as well
        hidden = torch.zeros(self.n_layers, batch_size, self.hidden_size)
        return hidden

This looks like most guides except I’ve added the parameter h in the forward function, as well as a way to determine whether or not we should initialize h. If someone could explain the concept of batches, and why most models I’ve seen aren’t passing along h, that would be super kind.

Here’s how I’m training:

# K = 80 (number of unique characters), m = 100
model = VanillaRNN(input_size = K, output_size = K, hidden_size = m, n_layers=1)
model.to(device)
n_epochs = 1
criterion = nn.CrossEntropyLoss()
# model.parameters are all params e.g U, V, b, c etc, see print below
#for param in list(model.parameters()):
#    print(param.shape)
opt = torch.optim.Adagrad(model.parameters(), lr=eta)

debug = 1
n_steps = 1
clip = 5
update_step = 0
smooth_loss = 0
for epoch in range(1, n_epochs + 1):
    model.should_init_hidden = True
    for i in range(x.shape[0]):
        if debug: 
            if(i > n_steps): 
                break    
        opt.zero_grad() # clear gradients in between updates
        seq = x[i]
        seq = np.expand_dims(seq, axis=0)
        seq = torch.from_numpy(seq)
        seq.to(device)
        #print("seq shape: " + str(seq.shape))        
        
        target = y[i]
        target = one_hot_to_ind(target)
        target = torch.from_numpy(target)
        target.to(device)
        #print("target shape: " + str(target.shape))
      
        #newY = torch.from_numpy(np.expand_dims(y[0], axis=0))
        #print("new y: " + str(newY.shape))
        if i == 0:
            output, hidden = model(seq)
            model.should_init_hidden = False
        else:
            output, hidden = model(seq, hidden)
        #output = np.expand_dims(output.detach().numpy(), axis=0)
        #output = torch.from_numpy(output)
        #print(y.view(-1).long().size())
        #print(output[0])
        loss = criterion(output, target)
        
        if update_step == 0:
            smooth_loss = loss
        else:
            smooth_loss = smooth_loss*0.999 + loss*0.001
        
        if update_step%500 == 0: 
            print(smooth_loss)
            
        loss.backward() # Does backpropagation and calculates gradients
        nn.utils.clip_grad_norm_(model.parameters(), clip) # Clip the gradients
        opt.step() # Updates the weights accordingly
        update_step += 1

However, now when I’m trying to train (it worked before, when I didn’t pass along my hidden state), I get this error message:

RuntimeError: Trying to backward through the graph a second time, but the buffers have already been freed. Specify retain_graph=True when calling backward the first time.

I think the knowledge I’m lacking is in how the machinery of Pytorch works behind the scences and the concept of batches. I wanted to stay as true to my matlab representation as possible (and how I handled input there), so that’s why I’m neglecting the batching.

EDIT: I noticed the part “Specify retain_graph=True when calling backward the first time.”, but I’m not sure using this option would lead to the behavior I’m expecting. I don’t understand what retaining the computational graph means.

EDIT2: Setting retain_graph = True yields NaN for my smooth_loss after 500 update steps, so doesn’t seem to work.

Any help on resolving this would be greatly appreciated :^)