Best way to load and organize a 2D CNN that takes in 12 images (ECG leads) per one instance of a class

So I am making a 2D CNN which takes in 12 images of individual ECG leads per one training example. So my network will basically concatenate 12 individual subnetworks into one output. My data set consists of heart arrythmia positive or negative patients. The folder structure is like this:

-Images
    -negative
        -patient0
            0.jpg
            ...
            12.jpg
        -patient1
            0.jpg
            ...
            12.jpg
        ...
    -positive
        -patient150
        ...

When I use datasets.ImageFolder I get one long list of individual images instead of grouped per group of 12. This way I cannot apply K-fold cross validation or shuffling of the dataset as the corresponding 12 lead images for one patients ECG are now no longer banded together.
So I’m wondering what would be a smart approach.