Correctly feeding LSTM with minibatch time sequence data

ptrblck · August 5, 2019, 9:26am

enumerate will return the index as the first value in the loop.
This code should work:

class OUDataset(Dataset):
  def __init__(self):
    self.oudataframe = pd.DataFrame(np.random.randn(1000, 2))

  def __len__(self):
    return len(self.oudataframe) - 100
  
  def __getitem__(self, idx):
    data = torch.zeros(100, 1)
    target = torch.zeros(100,1)
    for i in range(0, 100):
      data[i] = torch.tensor(self.oudataframe.iloc[idx+i][0])
      target[i] = torch.tensor(self.oudataframe.iloc[idx+i][1])
    return data, target

dataSet = OUDataset()
dataloader = DataLoader(dataSet, batch_size=20, shuffle=False)
for idx, (data, target) in enumerate(dataloader):
    print('BatchIdx {}, data.shape {}, target.shape {}'.format(
            idx, data.shape, target.shape))

goffta · August 6, 2019, 5:37pm

By my calculations, if there are 400,000 features, with each mini-batch being 20 sequences of length 100, in order to process the entire data set, it should take 200 batches 400000/(20*100). However, when I enumerate over the dataloader, this doesn’t seem to be the case. What am I missing here?
len(dataloader) returns 19995!

ptrblck · August 6, 2019, 7:09pm

For each passed index, you will get the window of the next following samples, which results in this overlapping windows approach as mentioned.

If you don’t want that, you could multiply the index by the window size and divide the length of your dataset by it:

class OUDataset(Dataset):
  def __init__(self, window_size=100):
    self.oudataframe = pd.DataFrame(np.random.randn(400000, 2))
    self.window_size = window_size

  def __len__(self):
    return len(self.oudataframe) // self.window_size
  
  def __getitem__(self, idx):
    idx = idx * self.window_size
    print('window: {}-{}'.format(idx, idx+self.window_size-1))
    data = torch.zeros(self.window_size, 1)
    target = torch.zeros(self.window_size,1)
    for i in range(0, self.window_size):
      data[i] = torch.tensor(self.oudataframe.iloc[idx+i][0])
      target[i] = torch.tensor(self.oudataframe.iloc[idx+i][1])
    return data, target

goffta · August 6, 2019, 7:35pm

This worked PERFECTLY! I didn’t understand what you meant earlier about overlapping windows. Looking at the logic, it’s all pretty clear now.

ptrblck · August 6, 2019, 8:49pm

Good to hear it’s working!

goffta · August 7, 2019, 8:40pm

There seems to be some sort of bug with the windows approach - although there are 200 mini-batches,
the last batch from the dataloader is 100 points shy. The last index is [19,100] instead of the [20, 100]. I just discovered this after a long training session. Ideas?

ptrblck · August 7, 2019, 10:29pm

The batch index 19 would get the last part of the dataset wouldn’t it?
Using 400 samples, a sequence length of 100 and a batch size of 2 seems to work:

class OUDataset(Dataset):
  def __init__(self, window_size=100):
    self.oudataframe = pd.DataFrame(np.random.randn(400, 2))
    self.window_size = window_size

  def __len__(self):
    return len(self.oudataframe) // self.window_size
  
  def __getitem__(self, idx):
    idx = idx * self.window_size
    print('window: {}-{}'.format(idx, idx+self.window_size-1))
    data = torch.zeros(self.window_size, 1)
    target = torch.zeros(self.window_size,1)
    for i in range(0, self.window_size):
      data[i] = torch.tensor(self.oudataframe.iloc[idx+i][0])
      target[i] = torch.tensor(self.oudataframe.iloc[idx+i][1])
    return data, target



dataSet = OUDataset()
dataloader = DataLoader(dataSet, batch_size=2, shuffle=False)
for idx, (data, target) in enumerate(dataloader):
    print('BatchIdx {}, data.shape {}, target.shape {}'.format(
            idx, data.shape, target.shape))

> window: 0-99
window: 100-199
BatchIdx 0, data.shape torch.Size([2, 100, 1]), target.shape torch.Size([2, 100, 1])
window: 200-299
window: 300-399
BatchIdx 1, data.shape torch.Size([2, 100, 1]), target.shape torch.Size([2, 100, 1])

goffta · August 7, 2019, 10:30pm

Silly off by one error in data generation module. Only had 399,999 points.

goffta · August 10, 2019, 9:49pm

So, another question in regards to properly calculating loss with this model while training - should I call loss.backward and optimizer.step after every mini-batch? or after every successive run through the complete data set?

Edit:

My current code looks like this:

dataloader = DataLoader(dataSet, batch_size=minibatch_size, shuffle=True, num_workers=8) # this will need to be shuffle=True
# self, input_size, output_size, hidden_size, num_layers
  model = OUSolverModel(input_size, output_size, hidden_size, num_layers).cuda()
  print("Training Model ... ")
  loss_list = []
  for i in range(0, epochSetting):
    model.train()
    opt = torch.optim.Adam(model.parameters(), lr=adj_learning_rate(i))
    print('Starting epoch: {}/{}'.format(i+1, epochSetting))
    for index, (data,target) in enumerate(dataloader):
      print('Epoch: {}/{} MiniBatch: {}/{}'.format(i+1, epochSetting, index+1, len(dataloader)))
      data = data.to(device)
      target = target.to(device)
      out = model(data)
      loss = nn.MSELoss(reduction='mean')(out, target)
      loss.backward()
      #print('Loss: {}'.format(loss.data.item()))
      opt.zero_grad()
      opt.step()
      #if(index == 199):
      #  loss_list.append(loss.data)
    loss_list.append(loss.data)
    print('Epoch: {}/{}, loss: {}'.format(i+1, epochSetting, loss.data.item()))

Davide_Concu · April 10, 2020, 8:30am

I am experiencing the same problem, I can’t figure out how to declare the dimension of the input for the Linear layer in the forward function:

I want to feed the lstm with batches of 256 sequences. Every sequence has 20 elements.

My DataLoader provides inputs of shape [256,20,1] that would be [batch_size,len_sequence,num_features] and my labels are tensors of 256 elements.

I know that the lstm receive by default as input [len_sequence,batch_size,num_features] but I am specifying batch_first = True so it rearrange the input by itself and also the output will be of shape [batch_size,len_sequence,num_features].

how do I have to declare the Linear layer?

For now I am using the following model and train but the predicted value y_pred has ha size of [20*256] that is messing up my loss function.

class LSTM(nn.Module):

    def __init__(self, input_dim=1, hidden_dim=64, batch_size=BATCH_SIZE, output_dim=1,
                    num_layers=2, sequence_length=train_window):
        super(LSTM, self).__init__()
        self.input_dim = input_dim
        self.hidden_dim = hidden_dim
        self.batch_size = batch_size
        self.num_layers = num_layers
        self.sequence_length = sequence_length

        # Define the LSTM layer
        self.lstm = nn.LSTM(self.input_dim, self.hidden_dim, self.num_layers, batch_first=True)

        # Define the output layer
        self.linear = nn.Linear(self.hidden_dim, output_dim)

    def init_hidden(self):
        # This is what we'll initialise our hidden state as
        return (torch.zeros(self.num_layers, self.batch_size, self.hidden_dim),
                torch.zeros(self.num_layers, self.batch_size, self.hidden_dim))

    def forward(self, input):
        lstm_out, self.hidden = self.lstm(input.view(len(input),self.sequence_length,-1))
        y_pred = self.linear(lstm_out)
        return y_pred.view(-1)

model = LSTM()

loss_fn = torch.nn.MSELoss()

optimiser = torch.optim.Adam(model.parameters(), lr=0.001)
####################
# Train model
#####################
num_epochs =500
tot_iterations = round(len(train_data_normalized)/BATCH_SIZE)
hist = np.zeros(num_epochs)

for t in range(1,num_epochs+1):
  iteration = 0
  for seq, labels in Train_set: 
    iteration += 1
    # Initialise hidden state
    # Don't do this if you want your LSTM to be stateful
    model.hidden = model.init_hidden()
    # Forward pas
    y_pred = model(seq)
    print(y_pred.shape,labels)
    loss = loss_fn(y_pred, labels)
    if iteration%25 == 0:
      print(f'epoch: {t:1} iteration {iteration:1}/{tot_iterations:3} loss: {loss.item():10.8f}')
    hist[t] = loss.item()

    # Zero out gradient, else they will accumulate between epochs
    optimiser.zero_grad()

    # Backward pass
    loss.backward()

    # Update parameters
    optimiser.step()

Any suggestion would be appreciated, thanks a lot!

ISMAX · February 20, 2021, 9:47pm

For the others who might have issues with RNN and multiple lengths sequences, here is a solution to create a Dataloader which pads the sequences at their beginning for them to have the same length (this works if your dataset __getitem__ method returns a pair (seq, target) ) :

from torch.nn.utils.rnn import pad_sequence

def collate_fn_pad(list_pairs_seq_target):
    seqs = [seq for seq, target in list_pairs_seq_target]
    targets = [target for seq, target in list_pairs_seq_target]
    seqs_padded_batched = pad_sequence(seqs)   # will pad at beginning of sequences
    targets_batched = torch.stack(targets)
    assert seqs_padded_batched.shape[1] == len(targets_batched)
    return seqs_padded_batched, targets_batched
    
dataloader = DataLoader(dataset, batch_size=32, shuffle=True, collate_fn=collate_fn_pad)

for seq, labels in dataloader:
        y_pred = rnn(seq)

Laura_Montalvo · February 25, 2021, 7:31am

Hi there! Sorry I’m late, I think you’re fully connected layer should be declare as:

        # Define the output layer
        self.linear = nn.Linear(self.hidden_dim*self.seq_length, output_dim)

Hope it helps!