Train CNN-RNN network for multi label video classification with sliding window technique

I’m implementing a model in which a CNN model is used to extract feature sequences from videos , and RNN is used to analyze the generated feature sequences, and output a multi label classification result. Frame-wise annotation of actions is not given but only the temporal order so the model is being trained with nn.CTCLoss.
I want to use sliding window tecnique to generate more video segments (overlapping windows) to capture dependencies over different timesteps.
Should i use x.unfold( ) in the input video or in the generated features after the cnn and before the rnn input ?
does it have any effect in the backward operation ?