Images as LSTM Input

Hi,

I want to feed in 18 images of size (3,128,128) into an lstm of 17 layers. I’m a bit confused about what my input should be. Docs mention that the input should be of shape(seq_len, batch_size, input_size), When I draw my 1st batch using a data loader I get a tensor of size (18,3,128,128) Does this mean that my LSTM input is: seq_len =18, batch_size=1, input size =3128128 ? Will this flatten the image to a 3128128 vector? Or do I have to reshape it manually?
I also want to implement Teacher Forcing so I will be modifying the RNN class.
What should forward look like? Here’s what I’m trying but I can’t figure out how to write it.

class trialLSTM(nn.Module):
def __init__(self, seq_len, input_size, hidden_size, batch_size, num_layers):
    super(trialLSTM, self).__init__()

    self.seq_len = seq_len
    self.input_size = input_size
    self.hidden_size = hidden_size
    self.batch_size = batch_size
    self.lstm = nn.LSTM(seq_len, batch_size, input_size)

def init_hidden(self):
    # initialize the hidden state and the cell state to zeros
    hidden = torch.zeros(self.batch_size, self.hidden_size)
    cell = torch.zeros(self.batch_size, self.hidden_size)
    if gpu:
        hidden = hidden.cuda()
        cell = cell.cuda()
    return hidden, cell

def forward(self, x, (h_0, c_0)):
    # Incoming x is (18,3,128,128)
    #do i need to reshape it to (1, 3, 128, 128)?like so:
    # for i in range(0, 18):
    #    x[i] = x[i].reshape(1, 3, 128, 128)
    # if yes, do I reshape it here or in the training loop?
    output = torch.empty(seq_len-1, seq_len-1, seq_len-1)
    for t in range(seq_len+1):
        if t==0:
            hidden, cell = self.lstm(x[0], (h_0,c_0))
        else:
            hidden, cell = self.lstm(x[t], (h_1,c_1)) 

TL;DR
I want to pass data of shape (18,3,128,128) (these are 18 images of shape (3,128,128) at a time in LSTM of 17 layers.
at time 0 input = (data[0], (h_0,c_0)) and output = (h_1,c_1)
at time 1 input = (data[1], (h_1, c_1)) and output = (h_2_c_2)

at time 16 input = (data[15], (h_15, c_15)) and output = (h_16, c_16)

so in my training loop:

model = trialLSTM(seq_len=18, input_size=(3*128*128), hidden_size=(3*128*128),
batch_size=1, num_layers=17)
def train(epoch):
    for batch_idx, (data, target) in enumerate(train_loader):
        optimizer.zero_grad()
        target = data[1,:]
        if gpu:
              data, target = data.cuda(), target.cuda()
        output = model(data, (h_0, c_0)
        # Here I should get a tensor of 18 hidden states of shape (3,128,128) each right?
        loss = nn.BCELoss(output, target)
        loss.backward()
        optimizer.step()

What am I doing wrong? What should I be doing? Please help!

You can’t pass input image size of (3 , 128 , 128) to LSTM. You should reshape to (batch,seq,feature). For example input image size of (3128128) → (1,128,3 * 128) or (1,3,128 * 128) . I think you need the CNN to extract feature before pass into LSTM.

Thanks for you reply. I did try it by reshaping it to (1, 3, 128, 128) and that worked. However I was stacking up LSTM cells instead of a layered LSTM.
Yes using CNN for feature extraction will work but I want to avoid it for a specific reason. I deliberately want to pass an image to the LSTM.

Did it work? I mean did you get a good accuracy?