Need help filling in missing intuitions between Dataloader and LSTM cells

My personal goal is to use the moving MNIST dataset to do some frame prediction with custom CLSTM cells. I am trying to understand how the normal LSTM takes in timestep data so I can mimic the attributes to create my own CLSTM cells. Let’s just assume the LSTM cell takes in 2D images for now, how is the timestep of the inputs managed? For example, the data loader outputs a batch of a non-randomized sequence of images.

[B, C, H, W] where B would represent the number of images.
Normally with a CNN we just feed the entire batch to the network without worrying about timesteps, so here do I need to write another for loop to make sure each cell gets a different image from the above batch?