How can I process stack of frames

Hi, I am building a reinforcement learning environment and have a question about: How can I process stack of frames? I plan to do it with Transformers, but the problem is I don’t know how to summarize frames into one state vector, what I mean is: I am doing frame processing by using pre-trained model, after that i plan to input those features into transformer network so that I get one vector which summarizes sequence of frames

Usually function approximators like CNNs and Transformers take in a stack of frames as input. Therefore, you don’t need to do any pre-processing.

What i meant was: i am featurizing images into vectors so that’s why i want to use resnet, it’s necessary. I am doing same project as CURL (contrastive unsupervised reinforcement learning).
input is: [B, C, F, H, W] batch size, channels, frames, height and width

Hey @dato_nefaridze can you provide a code snippet of some (even buggy) implementation of what you’re trying to do? I’m struggling to understand what it is…