Is this your implementation? It looks a bit odd to me, particularly since
out, hidden = self.lstm(x.unsqueeze(0))
is called within the loop, seemingly for a single frame (instead of a sequence of frames).
I can’t be sure however, since I don’t know the shape and nature of x_3d. Right now my guts say the code is off :). In general, CNN+LSTM is a common architecture, though.