Creating Custom Dataset from inbuilt pytorch datasets along with data transformations

Hi I am quite new to pytorch. I was trying to implement transfer learning with CIFAR10 and resnet18 model built in. For that what I am intending to do is first download original dataset and apply some transformations onto it and the take 500 samples from each class among the 10 classes and create a new dataset with 5000 total training samples(instead of 50,000 samples in original CIFAR10).
In that case, I am just extracting and arranging the indices of the samples of the set as per their class and then randomly picking 500 of them for each class. But with that method I am not able to achieve any performance from the network even if I am doing data augmentations on the original set.
SO is there a way to truly get the 500/per class data(using some custom dataloader) and then apply data augmentation on that to achieve the performance?
Thanks in advance!!

Hey, I wrote this dataset for you that gets a subset of CIFAR10 :slight_smile:. Just set the n_images_per_class to 500 and it should be ready to use!

from torchvision import datasets
from collections import defaultdict, deque
import itertools


class Cifar5000(datasets.CIFAR10):
    def __init__(self, path, transforms, train=True):
        super().__init__(path, train, download=True)
        self.transforms = transforms
        self.n_images_per_class = 5
        self.n_classes = 10
        self.new2old_indices = self.create_idx_mapping()

    def create_idx_mapping(self):
        label2idx = defaultdict(lambda: deque(maxlen=self.n_images_per_class))
        for original_idx in range(super().__len__()):
            _, label = super().__getitem__(original_idx)
            label2idx[label].append(original_idx)

        old_idxs = set(itertools.chain(*label2idx.values()))
        new2old_indices = {}
        for new_idx, old_idx in enumerate(old_idxs):
            new2old_indices[new_idx] = old_idx

        return new2old_indices

    def __len__(self):
        return len(self.new2old_indices)

    def __getitem__(self, index):
        index = self.new2old_indices[index]
        im, label = super().__getitem__(index)
        return self.transforms(im), label

Thanks @Oli will definitely check on this one!!:grinning:

1 Like

@Oli With your given method I created the dataset. But when I checked

len(trainset.data) 

it’s still 50,000. On the other hand when I checked the trainloader details it showed number of data points to be 5,000? Can you please explain what’s happening here?
Thanks!!:slightly_smiling_face:

What is trainset? If you do len(Cifar5000_dataset) it should give 5000. That’s why we implemented the
method

def __len__(self):
        return len(self.new2old_indices)

So after creating the class I created the object called trainset as

trainset = Cifar500(path = "./data",transforms = transform)

Then I am checking the number of samples as

len(trainset.data)

which returns 50,000(not 5,000).
I created the trainloader with the trainset as

trainloader = torch.utils.data.DataLoader(trainset, batch_size =32 ,shuffle=True, num_workers=2)

And

trainloader.dataset

prints

Dataset Cifar500
    Number of datapoints: 5000
    Root location: ./data
    Split: Train
    Compose(
    ToTensor()
    Normalize(mean=(0.4914, 0.4822, 0.4465), std=(0.2023, 0.1994, 0.201))
)

(5000 datapoints)
I am not able to understand which actually is being fed to the network 5000 datapoints or 50,000 samples.

If you change this to len(trainset) it should work :speedboat:
That’s the one which is being used

1 Like

@Oli I checked. Yes it is 5,000. Thanks for your precious help:pray::slightly_smiling_face:
By the way, with this implementation does the trainloader picks random 500 samples during each epoch or it is sampled only once at start?

@Ajinkya_Ambatwar You’re welcome :slight_smile:

It samples it only once at start, but it can be different each time you run your python file. If you want it to be the same every time you run your python script, you can use this seed function at the very beginning of your program

def seed_program(seed=0):
  ''' Seed for reproducability '''
  random.seed(seed)
  np.random.seed(seed)
  torch.manual_seed(seed)
  torch.cuda.manual_seed_all(seed)
  # torch.backends.cudnn.deterministic = True # You can add this
1 Like

@Oli Yes I am aware of that. Thanks for all the clarification!!
Also I had one another unrelated doubt. How can we add dropout to inbuilt pytorch models like say resnet18?I know that if we are custom building the network from scratch then we can add dropout layers using nn.dropout. But with built in models how can we do that?

I don’t know a way other than building a custom network. But you don’t have to build it from scratch, the resnet18 code is on github so you can copy that and change the source code :sailboat:

Oh thanks…That should help!!:grinning: