Transformers for image prediction

Hey!

I was wondering if Transformers can be use for image prediction?
For example: you’ve got a sequence of images and you want to predict what the next image in the sequence will be.

I’ve found plenty of examples for image classification (ViT), but not for prediction. (If you have some you’re welcome to share)

Thanks

This is text to video using transformers so I guess you can find something similar in the literature review