Hello guys, I’m working on translating a lab I did in Matlab to work in Pytorch.
The model is supposed to be a vanilla RNN that synthesizes text based on a book it trains on.
Most guides I’ve come across that are using the vanilla RNN module doesn’t seem to be passing the hidden state from the previous steps over to the next, and they’re also using “batches” in the input, something I haven’t worked with before. My approach have been to modify these guides as setting batch size to 1, and also passing the previous hidden state over to the next while synthesizing and while training, only resetting when entering a new epoch.
The guide I’ve been following: guide
However, I’ve run into a problem and I’d be really grateful if I could get some advice:
# Actually building the network class VanillaRNN(nn.Module): def __init__(self, input_size, output_size, hidden_size, n_layers): super().__init__() self.hidden_size = hidden_size self.n_layers = n_layers self.rnn = nn.RNN(input_size, hidden_size, n_layers, batch_first=True) self.fc = nn.Linear(hidden_size, len(chars)) # Defined myself self.should_init_hidden = True # Added a hidden state parameter def forward(self, x, h=None): batch_size = x.size(0) #print("batch size: " + str(batch_size)) # If first forward pass in training/while synthesizing if self.should_init_hidden: hidden = self.init_hidden(batch_size) # Else use the provided parameter else: hidden = h # Passing in the input and hidden state into the model and obtaining outputs out, hidden = self.rnn(x, hidden) # Reshaping the outputs such that it can be fit into the fully connected layer out = out.contiguous().view(-1, self.hidden_size) out = self.fc(out) return out, hidden def init_hidden(self, batch_size): # This method generates the first hidden state of zeros which we'll use in the forward pass # We'll send the tensor holding the hidden state to the device we specified earlier as well hidden = torch.zeros(self.n_layers, batch_size, self.hidden_size) return hidden
This looks like most guides except I’ve added the parameter h in the forward function, as well as a way to determine whether or not we should initialize h. If someone could explain the concept of batches, and why most models I’ve seen aren’t passing along h, that would be super kind.
Here’s how I’m training:
# K = 80 (number of unique characters), m = 100 model = VanillaRNN(input_size = K, output_size = K, hidden_size = m, n_layers=1) model.to(device) n_epochs = 1 criterion = nn.CrossEntropyLoss() # model.parameters are all params e.g U, V, b, c etc, see print below #for param in list(model.parameters()): # print(param.shape) opt = torch.optim.Adagrad(model.parameters(), lr=eta) debug = 1 n_steps = 1 clip = 5 update_step = 0 smooth_loss = 0 for epoch in range(1, n_epochs + 1): model.should_init_hidden = True for i in range(x.shape): if debug: if(i > n_steps): break opt.zero_grad() # clear gradients in between updates seq = x[i] seq = np.expand_dims(seq, axis=0) seq = torch.from_numpy(seq) seq.to(device) #print("seq shape: " + str(seq.shape)) target = y[i] target = one_hot_to_ind(target) target = torch.from_numpy(target) target.to(device) #print("target shape: " + str(target.shape)) #newY = torch.from_numpy(np.expand_dims(y, axis=0)) #print("new y: " + str(newY.shape)) if i == 0: output, hidden = model(seq) model.should_init_hidden = False else: output, hidden = model(seq, hidden) #output = np.expand_dims(output.detach().numpy(), axis=0) #output = torch.from_numpy(output) #print(y.view(-1).long().size()) #print(output) loss = criterion(output, target) if update_step == 0: smooth_loss = loss else: smooth_loss = smooth_loss*0.999 + loss*0.001 if update_step%500 == 0: print(smooth_loss) loss.backward() # Does backpropagation and calculates gradients nn.utils.clip_grad_norm_(model.parameters(), clip) # Clip the gradients opt.step() # Updates the weights accordingly update_step += 1
However, now when I’m trying to train (it worked before, when I didn’t pass along my hidden state), I get this error message:
RuntimeError: Trying to backward through the graph a second time, but the buffers have already been freed. Specify retain_graph=True when calling backward the first time.
I think the knowledge I’m lacking is in how the machinery of Pytorch works behind the scences and the concept of batches. I wanted to stay as true to my matlab representation as possible (and how I handled input there), so that’s why I’m neglecting the batching.
EDIT: I noticed the part “Specify retain_graph=True when calling backward the first time.”, but I’m not sure using this option would lead to the behavior I’m expecting. I don’t understand what retaining the computational graph means.
EDIT2: Setting retain_graph = True yields NaN for my smooth_loss after 500 update steps, so doesn’t seem to work.
Any help on resolving this would be greatly appreciated :^)