Why do I get a single hiddenstate for the whole batch when I try to process each timestep separately?

Shisho_Sama · July 23, 2024, 6:50am

Hello everyone, hope you are doing great.
I have a simple question, that might seem dumb, but I here it goes.

Why do I get a single hiddenstate for the whole batch when I try to process each timestep separately?
consider this simple case:

class Encoder(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_size, num_layers=1, bidirectional=False, dropout=0.3):
        super().__init__()
        self.vocab_size = vocab_size
        self.hidden_size = hidden_size
        self.num_layers = num_layers
        self.bidirectional = bidirectional
        self.dropout = dropout
        self.lstm = nn.LSTM(input_size=embedding_dim, 
                            hidden_size=self.hidden_size, 
                            num_layers=self.num_layers,
                            batch_first=True, 
                            dropout=self.dropout, 
                            bidirectional=self.bidirectional)
        self.embedding = nn.Embedding(self.vocab_size, embedding_dim)
        
        
    def forward(self, x, h):
        x = self.embedding(x)
        out = []
        for t in range(x.size(0)):
            xt = x[:,t]
            print(f'{t}')
            outputs, hidden = self.lstm(xt, h)
            out.append((outputs,hidden))
            print(f'{outputs.shape=}')   # prints (batchsize, timestep, hiddensize)
            print(f'{hidden[0].shape=}') # prints (1,hiddensize)
            print(f'{hidden[1].shape=}') # prints (1,hiddensize)
        return out


enc = Encoder(vocab_size=50,embedding_dim=75, hidden_size=100)
xt = torch.randint(0,50, size=(5,30))
h = None
enc(xt, None)

Now I’m expecting to get (batchsize, hiddensize) for my hiddensize, the same way my outputs come out ok as (batchsize, timestep, hiddensize). but for some reason, the hiddenstate shape is (1,hiddensize), not (5,hiddensize) which is the batchsize.
basically I’m getting a single hiddenstate for the whole batch at each iteration, but I get correct outputs for some reason!?

Obviously this doesn’t happen if I feed the whole input sequence all at once, but I need to grab each timestep for my Bahdanau attention mechanism. I’m not sure why this is happening?
any help is greatly appreciated.

Shisho_Sama · July 24, 2024, 6:23am

The issue is the way PyTorch requires the input to be!
Basically the input needs to be in the form of (batchsize, timestep, input_dim) for this to work.
All we had to do was :

# add a timestep dimension
xt = xt[:, None, :]