DataLoader and Dataset to shuffle from slice level, not subject level

Junhao_Wen · December 26, 2018, 7:18pm

Hi,

I am creating a Class based on Dataset, in the function of __getitem__, I read each MRI and get all slices (100 slices from a MRI) into a list. Then the dataset was fitted into the DataLoader:

train_loader = DataLoader(data_train,
                              batch_size=options.batch_size,
                              shuffle=True,
                              num_workers=options.num_workers,
                              drop_last=True,
                              pin_memory=True)

The problem is that If I set shuffle=True, the batch data is shuffled based on subject-level, and for example, If the batch_size is 16, it will give me 16 different subjects, repeating 100 times…. Actually, I do not wanna this behavior because the slices were not reallly shuffled… Do you have any ideas to shuffle from the slice level???

I have tried to read only one slice in the function of __getitem__, but when I train the mode, it is super slow…

Any idea would be appreciated…

smth · December 27, 2018, 12:20am

you can write a custom Sampler, and shuffle in a more fine-grained way. See the sampler keyword argument, instead of shuffle=True. You can see some of the samplers here: https://pytorch.org/docs/stable/data.html#torch.utils.data.Sampler

They are quite simple to implement, so you can implement your custom sampler that will be more aware of your dataset’s slicing that you want.

Junhao_Wen · December 27, 2018, 10:51am

@smth Actually, I have thought about this solution, but the problem is that the __len__ of DataLoader and the __len__

of sampler that I created were not equal, I do not see the possibility to handle that situation with the sampler.

I also tried to extract only one slice using __getitem__ from the whole MRI. The problem for this approach is that the memory exploses at some time during training…

smth · December 29, 2018, 7:02am

hmmm, how about you set batch_size=1 in the DataLoader, but your custom dataset itself returns a full batch everytime __getitem__ is called? That way you can carefully choose shuffling and other aspects by yourself in the Dataset.

gkrisp9 · January 22, 2022, 5:26pm

Hi, I am facing a similar problem at the time. Have you solved this problem ?