Resnet50 to classify human action on videos

Hey folks, I am trying to use a Resnet50 to classify human actions following this repo, I’ve implemented my own DataLoader to produce my batches from this dataset. Unlike the repo, I am not using the 3D CNN, but a simple Resnet50, thus, I need a 4D tensor input to feed the net, instead I have a 5D (Batch size, channels size, stacked images, Height, Width) from the Loader. Should I stop stacking this images and just iterate over the list of 4D tensors?