Hello,
I’m implementing RNN for video classification, but the accuracy doesn’t increase. I use pretrained resnet-50 as features extractor to feed LSTM. In particular I’m worried that I’m messing up the forward step of the network, here is my code
def forward(self, x, seq):
state = self._init_state(b_size=len(seq))
y = []
for i in range(len(seq)):
y.append(self.resnet(x[i]))
y = torch.nn.utils.rnn.pad_sequence(y)
pack = torch.nn.utils.rnn.pack_padded_sequence(y, seq, batch_first=False)
z, _ = self.lstm(pack, state)
z = nn.utils.rnn.pad_packed_sequence(z, batch_first=False)
t = []
for i in range(len(seq)):
t.append( z[0][seq[i]-1,i,:] )
t = torch.stack(t,0)
out = self.classifier(t)
out = self.out(out)
return out
I attach also the modified collate_fn function of the dataloader
def my_collate(batch):
frames, l = zip(*batch)
lengths = []
for i in range(len(frames)):
lengths.append(frames[i].shape[0])
perm_idx = sorted(range(len(lengths)), key=lengths.__getitem__, reverse=True)
frames_out = [frames[i] for i in perm_idx]
l_out = [l[i] for i in perm_idx]
lengths_out = [lengths[i] for i in perm_idx]
return frames_out, torch.LongTensor(l_out), lengths_out
Do you think there is a bug in my implementation? Thanks