A question about the way of DataLoader producing data

I have a dataset that contains multi-modal images with different sizes, and the corresponding modals image pairs need to get the same random transformation process, so they need to be produced at the same time.
I found that most of the examples using torch.utils.data.DataLoader to produce (images, labels), But how to produce a training data ((img1,img2,…), label)? here img1,img2,and img3 have different sizes and belong to the same category.
Any advice would be appreciated.
Thank you!

I think your question is similar to this one.