Hello guys, I’m working on translating a lab I did in Matlab to work in Pytorch.
The model is supposed to be a vanilla RNN that synthesizes text based on a book it trains on.
Most guides I’ve come across that are using the vanilla RNN module doesn’t seem to be passing the hidden state from the previous steps over to the next, and they’re also using “batches” in the input, something I haven’t worked with before. My approach have been to modify these guides as setting batch size to 1, and also passing the previous hidden state over to the next while synthesizing and while training, only resetting when entering a new epoch.
The guide I’ve been following: guide
However, I’ve run into a problem and I’d be really grateful if I could get some advice:
# Actually building the network
class VanillaRNN(nn.Module):
def __init__(self, input_size, output_size, hidden_size, n_layers):
super().__init__()
self.hidden_size = hidden_size
self.n_layers = n_layers
self.rnn = nn.RNN(input_size, hidden_size, n_layers, batch_first=True)
self.fc = nn.Linear(hidden_size, len(chars))
# Defined myself
self.should_init_hidden = True
# Added a hidden state parameter
def forward(self, x, h=None):
batch_size = x.size(0)
#print("batch size: " + str(batch_size))
# If first forward pass in training/while synthesizing
if self.should_init_hidden:
hidden = self.init_hidden(batch_size)
# Else use the provided parameter
else:
hidden = h
# Passing in the input and hidden state into the model and obtaining outputs
out, hidden = self.rnn(x, hidden)
# Reshaping the outputs such that it can be fit into the fully connected layer
out = out.contiguous().view(-1, self.hidden_size)
out = self.fc(out)
return out, hidden
def init_hidden(self, batch_size):
# This method generates the first hidden state of zeros which we'll use in the forward pass
# We'll send the tensor holding the hidden state to the device we specified earlier as well
hidden = torch.zeros(self.n_layers, batch_size, self.hidden_size)
return hidden
This looks like most guides except I’ve added the parameter h in the forward function, as well as a way to determine whether or not we should initialize h. If someone could explain the concept of batches, and why most models I’ve seen aren’t passing along h, that would be super kind.
Here’s how I’m training:
# K = 80 (number of unique characters), m = 100
model = VanillaRNN(input_size = K, output_size = K, hidden_size = m, n_layers=1)
model.to(device)
n_epochs = 1
criterion = nn.CrossEntropyLoss()
# model.parameters are all params e.g U, V, b, c etc, see print below
#for param in list(model.parameters()):
# print(param.shape)
opt = torch.optim.Adagrad(model.parameters(), lr=eta)
debug = 1
n_steps = 1
clip = 5
update_step = 0
smooth_loss = 0
for epoch in range(1, n_epochs + 1):
model.should_init_hidden = True
for i in range(x.shape[0]):
if debug:
if(i > n_steps):
break
opt.zero_grad() # clear gradients in between updates
seq = x[i]
seq = np.expand_dims(seq, axis=0)
seq = torch.from_numpy(seq)
seq.to(device)
#print("seq shape: " + str(seq.shape))
target = y[i]
target = one_hot_to_ind(target)
target = torch.from_numpy(target)
target.to(device)
#print("target shape: " + str(target.shape))
#newY = torch.from_numpy(np.expand_dims(y[0], axis=0))
#print("new y: " + str(newY.shape))
if i == 0:
output, hidden = model(seq)
model.should_init_hidden = False
else:
output, hidden = model(seq, hidden)
#output = np.expand_dims(output.detach().numpy(), axis=0)
#output = torch.from_numpy(output)
#print(y.view(-1).long().size())
#print(output[0])
loss = criterion(output, target)
if update_step == 0:
smooth_loss = loss
else:
smooth_loss = smooth_loss*0.999 + loss*0.001
if update_step%500 == 0:
print(smooth_loss)
loss.backward() # Does backpropagation and calculates gradients
nn.utils.clip_grad_norm_(model.parameters(), clip) # Clip the gradients
opt.step() # Updates the weights accordingly
update_step += 1
However, now when I’m trying to train (it worked before, when I didn’t pass along my hidden state), I get this error message:
RuntimeError: Trying to backward through the graph a second time, but the buffers have already been freed. Specify retain_graph=True when calling backward the first time.
I think the knowledge I’m lacking is in how the machinery of Pytorch works behind the scences and the concept of batches. I wanted to stay as true to my matlab representation as possible (and how I handled input there), so that’s why I’m neglecting the batching.
EDIT: I noticed the part “Specify retain_graph=True when calling backward the first time.”, but I’m not sure using this option would lead to the behavior I’m expecting. I don’t understand what retaining the computational graph means.
EDIT2: Setting retain_graph = True yields NaN for my smooth_loss after 500 update steps, so doesn’t seem to work.
Any help on resolving this would be greatly appreciated :^)