Dataloader: Batch then shuffle

RahimEntezari · August 4, 2020, 1:22pm

I want to change the order of shuffle and batch. Normally, when using the dataloader, the data is shuffles and then we batch the shuffled data:

import torch, torch.nn as nn
from torch.utils.data import DataLoader

x = DataLoader(torch.arange(10), batch_size=2, shuffle=True)


print(list(x))

batch [tensor(7), tensor(9)]
batch [tensor(4), tensor(2)]
batch [tensor(5), tensor(3)]
batch [tensor(0), tensor(8)]
batch [tensor(6), tensor(1)]

what I want is first batch then shuffle. One example output is as following:
batch [tensor(6), tensor(7)]
batch [tensor(0), tensor(1)]
batch [tensor(2), tensor(3)]
batch [tensor(8), tensor(9)]
batch [tensor(4), tensor(5)]

ptrblck · August 7, 2020, 5:01am

Based on your description it seems you would like to return shuffled pairs of data.
If that’s the case, I think the easiest way would be to return the pairs in Dataset.__getitem__ and reduce the length of the Dataset by 2x.
Let me know, if this would work for you.

RahimEntezari · August 7, 2020, 7:38am

Thanks for your suggestion. I guess what I did is similar to what you suggested.
Let me explain what I want to do in more detail.
I have CIFAR10 and ordered all samples based on a hardness measure (Curriculum). Then I want to load every 100 from the dataset, in order.
This is what I did:

change the dataset of (60000, 32, 32, 2) to (600, 100, 32, 32, 3)
write my custom dataloader
load every 100 one time and change the batch size for the loader to 1.

class MyDataset(Dataset):
    def __init__(self, data, targets, transform=None):
        self.data = data
        self.targets = torch.LongTensor(targets)
        self.transform = transform

    def __getitem__(self, index):
        x = self.data[index]
        y = self.targets[index]

        if self.transform:
            x = np.zeros((100, 3, 32, 32))
            for k in range(self.data[index].shape[0]):
                x[k] = self.transform((255.0 * self.data[index][k]).astype(np.uint8))
        return x, y

    def __len__(self):
        return len(self.data)

normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406],
                                 std=[255.0* 0.229, 255.0* 0.224, 255.0*0.225])

transform = transforms.Compose([transforms.ToPILImage(),
                                transforms.RandomHorizontalFlip(),
                                transforms.RandomCrop(32, 4),
                                transforms.ToTensor(),
                                normalize])

features_list = []
labels_list = []
for i in range(500):
    a = [features[i * 100:((i + 1) * 100)]]
    features_list.append(a[0])
    b = [labels[i * 100:((i + 1) * 100)]]
    labels_list.append(b[0])

dataset = MyDataset(np.asarray(features_list), np.asarray(labels_list), transform=transform)
train_loader = DataLoader(dataset, batch_size=1, shuffle=True)



features_list = []
labels_list = []
for i in range(100):
    a = [features[50000 + (i * 100):50000 + ((i + 1) * 100)]]
    features_list.append(a[0])
    b = [labels[i * 100:((i + 1) * 100)]]
    labels_list.append(b[0])

dataset = MyDataset(np.asarray(features_list), np.asarray(labels_list), transform=transform)
val_loader = DataLoader(dataset, batch_size=1, shuffle=True)

What do you think of this?

ptrblck · August 7, 2020, 11:40pm

The approach looks fine to me.
A minor suggestion: you probably don’t need the list creation and [0] indexing here:

a = [features[50000 + (i * 100):50000 + ((i + 1) * 100)]]
features_list.append(a[0])