I want to change the order of shuffle and batch. Normally, when using the dataloader, the data is shuffles and then we batch the shuffled data:
import torch, torch.nn as nn
from torch.utils.data import DataLoader
x = DataLoader(torch.arange(10), batch_size=2, shuffle=True)
print(list(x))
batch [tensor(7), tensor(9)]
batch [tensor(4), tensor(2)]
batch [tensor(5), tensor(3)]
batch [tensor(0), tensor(8)]
batch [tensor(6), tensor(1)]
what I want is first batch then shuffle. One example output is as following:
batch [tensor(6), tensor(7)]
batch [tensor(0), tensor(1)]
batch [tensor(2), tensor(3)]
batch [tensor(8), tensor(9)]
batch [tensor(4), tensor(5)]
Based on your description it seems you would like to return shuffled pairs of data.
If that’s the case, I think the easiest way would be to return the pairs in Dataset.__getitem__
and reduce the length of the Dataset
by 2x
.
Let me know, if this would work for you.
1 Like
Thanks for your suggestion. I guess what I did is similar to what you suggested.
Let me explain what I want to do in more detail.
I have CIFAR10 and ordered all samples based on a hardness measure (Curriculum). Then I want to load every 100 from the dataset, in order.
This is what I did:
- change the dataset of (60000, 32, 32, 2) to (600, 100, 32, 32, 3)
- write my custom dataloader
- load every 100 one time and change the batch size for the loader to 1.
class MyDataset(Dataset):
def __init__(self, data, targets, transform=None):
self.data = data
self.targets = torch.LongTensor(targets)
self.transform = transform
def __getitem__(self, index):
x = self.data[index]
y = self.targets[index]
if self.transform:
x = np.zeros((100, 3, 32, 32))
for k in range(self.data[index].shape[0]):
x[k] = self.transform((255.0 * self.data[index][k]).astype(np.uint8))
return x, y
def __len__(self):
return len(self.data)
normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406],
std=[255.0* 0.229, 255.0* 0.224, 255.0*0.225])
transform = transforms.Compose([transforms.ToPILImage(),
transforms.RandomHorizontalFlip(),
transforms.RandomCrop(32, 4),
transforms.ToTensor(),
normalize])
features_list = []
labels_list = []
for i in range(500):
a = [features[i * 100:((i + 1) * 100)]]
features_list.append(a[0])
b = [labels[i * 100:((i + 1) * 100)]]
labels_list.append(b[0])
dataset = MyDataset(np.asarray(features_list), np.asarray(labels_list), transform=transform)
train_loader = DataLoader(dataset, batch_size=1, shuffle=True)
features_list = []
labels_list = []
for i in range(100):
a = [features[50000 + (i * 100):50000 + ((i + 1) * 100)]]
features_list.append(a[0])
b = [labels[i * 100:((i + 1) * 100)]]
labels_list.append(b[0])
dataset = MyDataset(np.asarray(features_list), np.asarray(labels_list), transform=transform)
val_loader = DataLoader(dataset, batch_size=1, shuffle=True)
What do you think of this?
The approach looks fine to me.
A minor suggestion: you probably don’t need the list
creation and [0]
indexing here:
a = [features[50000 + (i * 100):50000 + ((i + 1) * 100)]]
features_list.append(a[0])