Hi, I was experimenting with LSTMs and noted that the training for an unrolled LSTM seems to be a lot worse than a rolled one. The test errors I get are a lot higher.
So below are two variants of my code that are relevant. The remainder of my code is untouched. This one is the unrolled version. I simply pass the data with the full timestep before sending it to a fully connected layer.
class Net(nn.Module):
def __init__(self, feature_dim, hidden_dim, batch_size):
super(Net, self).__init__()
num_layers=1
# single layer lstm
self.lstm = nn.LSTM(feature_dim, hidden_size=hidden_dim, num_layers=num_layers, batch_first=True, dropout = 0.7)
self.h0 = Variable(torch.randn(num_layers, batch_size, hidden_dim))
self.c0 = Variable(torch.randn(num_layers, batch_size, hidden_dim))
# fc layers
self.fc1 = nn.Linear(hidden_dim, 2)
def forward(self, x, mode=False):
output, hn = self.lstm(x, (self.h0,self.c0))
output = self.fc1(output[:,-1,:])
return output
And the test errors (right most result, out of 100)
epoch 0 tr loss 54.90 te loss 17.37 tr err 144/316 te err 51/100
epoch 20 tr loss 48.21 te loss 15.11 tr err 96/316 te err 31/100
epoch 40 tr loss 37.15 te loss 13.07 tr err 71/316 te err 27/100
epoch 60 tr loss 31.83 te loss 15.43 tr err 62/316 te err 28/100
epoch 80 tr loss 27.14 te loss 25.34 tr err 45/316 te err 29/100
epoch 100 tr loss 24.40 te loss 32.11 tr err 39/316 te err 28/100
epoch 120 tr loss 23.74 te loss 22.59 tr err 32/316 te err 24/100
epoch 140 tr loss 28.67 te loss 23.78 tr err 50/316 te err 26/100
epoch 160 tr loss 15.99 te loss 29.97 tr err 24/316 te err 30/100
epoch 180 tr loss 18.61 te loss 29.87 tr err 22/316 te err 26/100
epoch 200 tr loss 25.49 te loss 36.15 tr err 31/316 te err 28/100
epoch 220 tr loss 20.56 te loss 33.28 tr err 33/316 te err 24/100
epoch 240 tr loss 6.13 te loss 49.73 tr err 7/316 te err 25/100
epoch 260 tr loss 18.26 te loss 38.68 tr err 12/316 te err 27/100
epoch 280 tr loss 4.94 te loss 54.48 tr err 4/316 te err 23/100
epoch 300 tr loss 4.12 te loss 57.66 tr err 9/316 te err 25/100
epoch 320 tr loss 20.31 te loss 47.79 tr err 28/316 te err 28/100
epoch 340 tr loss 3.74 te loss 76.23 tr err 10/316 te err 28/100
epoch 360 tr loss 20.10 te loss 45.14 tr err 25/316 te err 23/100
epoch 380 tr loss 2.62 te loss 54.53 tr err 16/316 te err 28/100
epoch 400 tr loss 2.22 te loss 51.11 tr err 13/316 te err 24/100
epoch 420 tr loss 2.21 te loss 55.38 tr err 12/316 te err 29/100
epoch 440 tr loss 5.46 te loss 51.78 tr err 11/316 te err 22/100
epoch 460 tr loss 1.88 te loss 46.23 tr err 13/316 te err 25/100
epoch 480 tr loss 8.04 te loss 43.05 tr err 19/316 te err 25/100
Now I loop through the data and pass each timestep before sending the final output to a fully connected layer.
class Net(nn.Module):
def __init__(self, feature_dim, hidden_dim, batch_size):
super(Net, self).__init__()
# lstm architecture
self.hidden_size=hidden_dim
self.input_size=feature_dim
self.batch_size=batch_size
self.num_layers=1
# lstm
self.lstm = nn.LSTM(feature_dim, hidden_size=self.hidden_size, num_layers=self.num_layers, batch_first=True)
# fc layers
self.fc1 = nn.Linear(hidden_dim, 2)
def forward(self, x, mode=False):
# initialize hidden and cell
hn = Variable(torch.randn(self.num_layers, self.batch_size, self.hidden_size))
cn = Variable(torch.randn(self.num_layers, self.batch_size, self.hidden_size))
# step through the sequence one timestep at a time
for xt in torch.t(x):
output, (hn,cn) = self.lstm(xt[:,None,:], (hn,cn))
# output is [batch size, timestep = 1, hidden dim]
output = self.fc1(output[:,0,:])
return output
And the test errors
epoch 0 tr loss 54.89 te loss 17.44 tr err 154/316 te err 53/100
epoch 20 tr loss 48.50 te loss 17.40 tr err 84/316 te err 43/100
epoch 40 tr loss 36.92 te loss 15.90 tr err 72/316 te err 34/100
epoch 60 tr loss 32.13 te loss 18.82 tr err 52/316 te err 32/100
epoch 80 tr loss 29.61 te loss 27.07 tr err 41/316 te err 27/100
epoch 100 tr loss 30.03 te loss 28.65 tr err 41/316 te err 31/100
epoch 120 tr loss 22.94 te loss 39.26 tr err 32/316 te err 31/100
epoch 140 tr loss 22.82 te loss 43.07 tr err 28/316 te err 33/100
epoch 160 tr loss 19.11 te loss 47.77 tr err 34/316 te err 32/100
epoch 180 tr loss 19.52 te loss 46.45 tr err 29/316 te err 33/100
epoch 200 tr loss 22.89 te loss 45.91 tr err 21/316 te err 29/100
epoch 220 tr loss 24.83 te loss 50.92 tr err 28/316 te err 35/100
epoch 240 tr loss 12.37 te loss 54.97 tr err 36/316 te err 34/100
epoch 260 tr loss 11.72 te loss 54.28 tr err 30/316 te err 33/100
epoch 280 tr loss 9.71 te loss 55.99 tr err 20/316 te err 35/100
epoch 300 tr loss 21.23 te loss 71.60 tr err 27/316 te err 34/100
epoch 320 tr loss 8.87 te loss 53.11 tr err 32/316 te err 31/100
epoch 340 tr loss 7.34 te loss 59.80 tr err 32/316 te err 37/100
epoch 360 tr loss 4.35 te loss 73.08 tr err 7/316 te err 35/100
epoch 380 tr loss 5.93 te loss 68.64 tr err 27/316 te err 33/100
epoch 400 tr loss 3.67 te loss 78.00 tr err 18/316 te err 35/100
epoch 420 tr loss 15.13 te loss 64.23 tr err 39/316 te err 38/100
epoch 440 tr loss 2.61 te loss 88.74 tr err 8/316 te err 38/100
epoch 460 tr loss 4.82 te loss 82.88 tr err 5/316 te err 38/100
epoch 480 tr loss 2.72 te loss 93.69 tr err 8/316 te err 42/100
I have run this experiment several times and always see that the unrolled version performs worse. Is there something wrong with the way I am manually stepping through the LSTM ?