Hey all

I am currently experimenting with different LSTM models for time-series prediction and am noticing a strange behaviour difference between two implementations that should be (from my understanding) fundamentally the same.

For my first implementation I used two `LSTMCells`

and avoided pytorch’s built in batching. I initially did this because my training data has variable sequence lengths and I was averse to dealing with the variable input length when it comes to batching, which is why I kind of cheated with the batch dimension in the `forward`

function. Thus to handle batches, (as shown below) I run each sample in the batch through the network individually, calculate the loss and the associated gradients, caching the gradients and updating the network using the accumulated gradients.

For my second model I tried to utilize pytorch optimizations and instead of `LSTMCell`

I used actual `LSTM`

layers and the built-in batching, which should result in a considerable speed up. Note that for this model, I always use `num_layers=2`

.

I then tried to compare how these two models train and found that for implementation 1, the model exhibits two phases, one where it quickly decreases, and another where it has noisy oscillations around a slowly decreasing average (much like an L shape). For model 2 I expected the same behaviour but found that while it also has the same initial phase where it quickly decreases and leads into the same second phase seen in implementation 1, once in this phase the MSELoss occasionally spikes up really high, and then the model enters another phase where it quickly decreases until it levels out again.

Is there any obvious reason why I may be seeing this difference? For both of these models I am using the `Adam`

optimizer with its default parameters, and `torch.nn.MSELoss()`

for the loss function. Ultimately I’d like to use Implementation 2 because it does lead to a considerable speed up in the training process, but I’m concerned that I am doing something wrong.

**Update:** I’ve realized that since I am not dividing the loss function by the batch size for implementation 1, the gradients may differ significantly in magnitude but this would mean that I’d have larger gradients for the first implementation, would it not? Meaning that if I was to see spiking behaviour, it should be for Imp 1 instead of Imp 2.

## Implementation 1: using naive batching

**Model**

```
class PyLSTMCells(nn.Module):
# init layers
def __init__(self, input_size, hidden_size, output_size):
super(PyLSTMCells, self).__init__()
# attach hidden layer size to class
self.hidden_size = hidden_size
# define layers
self.l1 = nn.LSTMCell(input_size, hidden_size)
self.l2 = nn.LSTMCell(hidden_size, hidden_size)
self.o = nn.Linear(hidden_size, output_size)
# define forward pass over single time-series
def forward(self, input):
# get tensor type of input
Ttype = input.type()
# add an additional "first" dimension to work
# with the LSTMCell interface
pad_input = input.unsqueeze(0)
# define hidden/cell states
h1 = torch.zeros(1, self.hidden_size).type(Ttype)
c1 = torch.zeros(1, self.hidden_size).type(Ttype)
h2 = torch.zeros(1, self.hidden_size).type(Ttype)
c2 = torch.zeros(1, self.hidden_size).type(Ttype)
# loop over time series and get final output
TS_len = pad_input.shape[1]
for i in range(TS_len):
h1, c1 = self.l1(pad_input[:,:,i], (h1,c1))
h2, c2 = self.l2(h1, (h2,c2))
output = self.o(h2)
# return output, ignoring the false "first" dim
# that was added
return output[0,:]
```

**Training Method**

```
opt.zero_grad() # Adam optimizer with default parameters
batch_loss = 0
for i in range(batch_size):
X, lbl = get_example()
output = model.forward(X) # where model is the network defined above
l = loss(output, lbl) # MSELoss
l.backward()
batch_loss += l
avg_batch_loss = batch_loss/batch_size
opt.step()
```

## Implementation 2: LSTM layers and Pytorch batching

**Model**

```
class BatchPyLSTM(nn.Module):
# init network
def __init__(self, input_size, hidden_size, output_size, num_layers=2):
super(BatchPyLSTM, self).__init__()
# attach important vars to class
self.input_size = input_size
self.hidden_size = hidden_size
self.num_layers = num_layers
# define layers
self.lstm = nn.LSTM(input_size, hidden_size, num_layers, batch_first=True)
self.lin = nn.Linear(hidden_size, output_size)
# init hidden and cell states
def init_hidden(self, Ttype, batch_size):
h0 = torch.zeros(self.num_layers, batch_size, self.hidden_size).type(Ttype) # hidden states
c0 = torch.zeros(self.num_layers, batch_size, self.hidden_size).type(Ttype) # cell states
return h0,c0
# forward pass over entire batch
def forward(self, input, seq_lens):
# get tensor type of input and batch size
Ttype = input.type()
batch_size = input.shape[0]
# init hidden and cell states
h0, c0 = self.init_hidden(Ttype, batch_size)
# reshape before packing to match pytorch interface
pckd_input = input.permute(0,2,1)
# pack input so the LSTM knows now to consider the padded values
pckd_input = nn.utils.rnn.pack_padded_sequence(pckd_input, seq_lens,
batch_first=True)
# feed through LSTM
lstm_otp,_ = self.lstm(pckd_input,(h0,c0))
# repad lstm_otp into normal tensor object so we can actually work with it
lstm_otp,_ = nn.utils.rnn.pad_packed_sequence(lstm_otp, batch_first=True,
padding_value=float('nan'))
# extract lstm output for final time-step of each series
fin_lstm_otp = torch.stack([ o[l-1,:] for o,l in zip(lstm_otp, seq_lens) ])
# feed lstm output tensors
otp = self.lin(fin_lstm_otp)
return otp
```

**Training Method**

```
opt.zero_grad() # Adam optimizer with default parameters
X_batch, lbl_batch, seq_lens = get_batch(batch_size)
output = model.forward() # implementation 2 network
avg_batch_loss = loss(output, lbl_batch) # MSELoss
avg_batch_loss.backward()
opt.step()
```