Confused About Batch LSTM

Hi,

I initially was using single batch sequence classification where I pass multiple variable length sequence.

For eg : if a particular data sequence have a sequence of length 10 with 5000 dim it is 10x1x5000, I then use the label for each of the sequence of size 10 ( one label for each sequence ) and as per the LSTM tutorial I model the forward() function as follows -

def __init__(hidden_dim,layers,embed_size,num_labels):
    ...
    ...
    self.lstm = nn.LSTM(input_size= self.embed_sz,hidden_size=self.hidden_dim,num_layers=self.layers)
    self.cl = nn.Linear(self.hidden_dim,self.num_labels)
    self.hid = self.init_hidden()
def init_hidden(self):
    hidden_1 = Variable(torch.zeros(self.layers,self.batch_size,self.hidden_dim))
    return (hidden_1,hidden_1)
def forward(self,x):
    out,self.hid = self.lstm(x.view(self.window_length,self.batch_size,self.embed_size),self.hid)
    return F.log_softmax(self.cl(out.view(len(x),-1)),dim=1)

But now I want to try for batches instead of taking entire seuqnce length I limit it to 5 and take a batch size of 2, so now my input is fixed to 5x2x5000 , How do I modify the forward function ? I saw in other examples that they only do self.cl(out[-1]), but here I want the lstm to see the entire sequence length per batch and learn.

I guess my label would be of size 5x2 in this case ? I am not sure. Please help.

1 Like

I was able to fix this by change the size of nn.Linear in self.cl as

self.cl = nn.Linear(self.batch_size*self.hidden_dim,self.num_labels)

and in forward()

out,self.hid = self.lstm(x.view(self.window_length,self.batch_size,self.embed_size),self.hid)
return F.log_softmax(self.cl(out.view(len(x),-1)),dim=1)

I seem to be getting pretty good results with this arrangement, but I want to know – is this correct approach ? and if so why ?

The lstm output (out) has shape (seq_len, batch, hidden_size). c.f https://discuss.pytorch.org/t/understanding-output-of-lstm/12320/2

The linear layer expects inputs of shape (batch, any, number, of, extra, dims, features). So I think your linear layer is using the timesteps as though they were batches and up the batches as though they were timesteps. That said, for the code snippets you posted to actually work the Linear layer must be automatically combining some dimensions of its input and I didn’t know that it could do that.

To correct this you need to declare the linear layer to take inputs with self.hidden_dim*seq_len features.

# in __init__()
self.cl = nn.Linear(self.hidden_dim*seq_len, self.num_labels)

Then you need to permute the dimensions of out to bring the batch dimension into the first place, and finally flatten the time & feature dimensions otherwise the linear layer will output too many dimensions.

# in forward()
input_to_cl = out.permute(1, 0, 2).view(-1, self.hidden_dim*len(x))`
1 Like

Hmm interesting. But can you explain the good results though ? Even though it perceived my inputs to it in the opposite manner ? Perhaps the order of the frames in my video does not matter

Without seeing the full code I can only emit confused guesses.

Sure!

I used the configuration for forward() in my second post,

for training i simply do

label = torch.LongTensor(input_embedding.size(0).fill_(label_num)) 
#label_num is some label corresponding to the 3 batches,
# since all the 3 batches at a time belong to the same label, maybe it doesn't matter possibly the reason
# for giving me good result.

out = model_lstm(input_embeddings)
# input_embeddings has size (5,3,1024) where 5 is the sequence length and 3 is the number of batches
loss = criterion (out.view(out.size(0),-1),label)

loss.backward()
op.step()

To calculate the validation accuracy in the train function I just do

_,predicted_label = out.max(dim=1)
for element in predicted_label.cpu().data.numpy().astype(int):
    conf_matrix[label.cpu().data[0],element]+=1
    # conf_matrix is a confustion matrix I make in the beggining
    # note that I only use the 0th position of the true label 
    # since at a time the batches loaded belong to the same label.

I am guessing that the good results despite moving from num_batches =1 to num_batches > 1 is because of the way I load the data. It may seem that batch and sequence is commutative of sorts since at a time all the frames in embedding belong to the same label.

What I mean is
1 2 3 4 5
6 7 8 9 A
B C D E F

Although I have organized the 1 2 3 4 5 part as one sequence in batch dim = 0 , the way I fed data into the linear layer might be looking at 1 6 B as one batch instead but since 1 and 6 and B have the same label and similar appearance, the results are still “good”

Having 1 6 B as a batch is the correct and normal way of doing things, and normally your model will keep the calculations for each element of the batch separated.

But as far as I can see your model does mix up the dimensions and end up merging the elements of the batch together.

Adding to the above code - how would this generalize for variable sequence lengths ? Ie if the batch size was 1 and sequence length is varying as given in the pytorch tutorial http://pytorch.org/tutorials/beginner/nlp/sequence_models_tutorial.html#sphx-glr-beginner-nlp-sequence-models-tutorial-py

Here the sequence length of train data is 5 and 4 for the two sentences

You would have to use pack_padded_sequence, but I am not familiar with its use.

I believe pack_padded_sequence forces all sequence lengths to be fixed to the longest sequence length by zero padding. But in the tutorial variable sequence lengths are passed without using pack_padded_sequences ie by simply passing in say data_point1 of size (5,1,9) and data_point2 is of size (4,1,9). When you were describing your explanation above what was the label size that you were using ? Was it of size 1 per batch ? Because I was using size sequence_length per batch.

So when I did “my method” while calculating the loss I did
criterion( out.view(out.size(0),-1) , label )
where out is of size (sequence_length,number_of_labels) and label is of size (sequence_length,)

I am just confused - why does the tutorial give label to each word rather than the word after a fixed sequence length of words ? Does it utilize any kind of temporal information or is it treating everything like something of sequence length = 1