Custom dataset based on CIFAR10

Sayyed_Ali_Mousavi · January 26, 2023, 8:05pm

Hello everyone.
I want to create a dataset based on CIFAR10. Then create a dataloader and train my model on it.
I have a function that gives some noises to the images of CIFAR10, say:

def create_noise(model, image):
      .....
      return noisy_image

What is the best way to create this dataset and dataloader of noisy images?
Things I did:

I tried to append the new data in a list, But the problem with this method is that this list becomes very large and a memory error may appear.
I use custom dataset approach as follows:

class MyDataset(Dataset):
    def __init__(self, model):
        transform = transforms.ToTensor()
        self.dst_train = datasets.CIFAR10('data', train=True, download=True, transform=transform)
        self.model = model


    def __getitem__(self, idx):
        image = create_noise(self.model, self.dst_train[idx][0].unsqueeze(0))
        image = image.squeeze(0)
        label = self.dst_train[idx][1]
        return image, label
    def __len__(self):
        return len(self.dst_train)

This method works. But the problem is about ‘Time’. Although the create_noise function can receive batch data, here it has to take the images one by one, which takes a lot of time.
Can we pass a batch of data in __getitem__ . If the answer is yes, how? And in this case, how is the dataloader made for the training?
Finally, I don’t have enough experience with PyTorch and I may not have chosen the right method.
I appreciate any help.

ptrblck · January 27, 2023, 6:45am

A custom Dataset should certainly work and depending on the create_noise method you could directly add the noise to the data as seen in this post or sample it in each iteration.
Alternatively, you could also write a custom transformation as seen in this post, which might be a better approach.

However, based on your description I understand that create_noise might be expensive and you want to avoid calling it for each sample and would thus prefer to call it for the entire batch.
In this case you could use the BatchSampler and pass the indices for the entire batch to __getitem__ as seen in this post. This would also mean that the specified batch_size in the DataLoader is not representing the actual batch size any more and the number of samples in each batch would be defined by loader.batch_size * sampler.batch_size. In my example I’ve defined it only in the sampler and kept it default in the DataLoader.

Sayyed_Ali_Mousavi · January 27, 2023, 12:30pm

Thank you @ptrblck. With your help, I created the following code:

class MyDataset(Dataset):
    def __init__(self, model):
        transform = transforms.ToTensor()
        self.dst_train = datasets.CIFAR10('data', train=True, download=True, transform=transform)
        self.model = model

    def __getitem__(self, idx):
        image = create_noise(self.model, self.dst_train[idx][0])
        image = image.data
        label = self.dst_train[idx][1]
        return image, label
    def __len__(self):
        return len(self.dst_train)

and

net = net.to(device)
noisy_dataset = MyDataset(net)
sampler = torch.utils.data.sampler.BatchSampler(
torch.utils.data.sampler.RandomSampler(noisy_dataset),
batch_size=64, drop_last=False)

train_loader = DataLoader(noisy_dataset, sampler=sampler)

But in the training model with this train_loader, I get the following error:

TypeError: list indices must be integers or slices, not list

Can you help me with this?

thecho7 · January 27, 2023, 1:54pm

It seems self.dst_train is not a tensor but a list.
From the example which is @ptrblck mentioned,

The dataset is a tensor has size of (100, 1) but your self.dst_train is a list.
List type cannot be indexed with list where a tensor can do.

Try convert self.dst_train to a tensor.
Should work