Creating Custom Dataset from inbuilt pytorch datasets along with data transformations

Ajinkya_Ambatwar · October 15, 2019, 9:42am

Hi I am quite new to pytorch. I was trying to implement transfer learning with CIFAR10 and resnet18 model built in. For that what I am intending to do is first download original dataset and apply some transformations onto it and the take 500 samples from each class among the 10 classes and create a new dataset with 5000 total training samples(instead of 50,000 samples in original CIFAR10).
In that case, I am just extracting and arranging the indices of the samples of the set as per their class and then randomly picking 500 of them for each class. But with that method I am not able to achieve any performance from the network even if I am doing data augmentations on the original set.
SO is there a way to truly get the 500/per class data(using some custom dataloader) and then apply data augmentation on that to achieve the performance?
Thanks in advance!!

Oli · October 15, 2019, 12:13pm

Hey, I wrote this dataset for you that gets a subset of CIFAR10 . Just set the n_images_per_class to 500 and it should be ready to use!

from torchvision import datasets
from collections import defaultdict, deque
import itertools


class Cifar5000(datasets.CIFAR10):
    def __init__(self, path, transforms, train=True):
        super().__init__(path, train, download=True)
        self.transforms = transforms
        self.n_images_per_class = 5
        self.n_classes = 10
        self.new2old_indices = self.create_idx_mapping()

    def create_idx_mapping(self):
        label2idx = defaultdict(lambda: deque(maxlen=self.n_images_per_class))
        for original_idx in range(super().__len__()):
            _, label = super().__getitem__(original_idx)
            label2idx[label].append(original_idx)

        old_idxs = set(itertools.chain(*label2idx.values()))
        new2old_indices = {}
        for new_idx, old_idx in enumerate(old_idxs):
            new2old_indices[new_idx] = old_idx

        return new2old_indices

    def __len__(self):
        return len(self.new2old_indices)

    def __getitem__(self, index):
        index = self.new2old_indices[index]
        im, label = super().__getitem__(index)
        return self.transforms(im), label

Ajinkya_Ambatwar · October 15, 2019, 1:29pm

Thanks @Oli will definitely check on this one!!

Ajinkya_Ambatwar · October 15, 2019, 2:47pm

@Oli With your given method I created the dataset. But when I checked

len(trainset.data)

it’s still 50,000. On the other hand when I checked the trainloader details it showed number of data points to be 5,000? Can you please explain what’s happening here?
Thanks!!

Oli · October 15, 2019, 2:56pm

What is trainset? If you do len(Cifar5000_dataset) it should give 5000. That’s why we implemented the
method

def __len__(self):
        return len(self.new2old_indices)

Ajinkya_Ambatwar · October 15, 2019, 3:02pm

So after creating the class I created the object called trainset as

trainset = Cifar500(path = "./data",transforms = transform)

Then I am checking the number of samples as

len(trainset.data)

which returns 50,000(not 5,000).
I created the trainloader with the trainset as

trainloader = torch.utils.data.DataLoader(trainset, batch_size =32 ,shuffle=True, num_workers=2)

And

trainloader.dataset

prints

Dataset Cifar500
    Number of datapoints: 5000
    Root location: ./data
    Split: Train
    Compose(
    ToTensor()
    Normalize(mean=(0.4914, 0.4822, 0.4465), std=(0.2023, 0.1994, 0.201))
)

(5000 datapoints)
I am not able to understand which actually is being fed to the network 5000 datapoints or 50,000 samples.

Oli · October 15, 2019, 3:05pm

If you change this to len(trainset) it should work
That’s the one which is being used

Ajinkya_Ambatwar · October 15, 2019, 3:25pm

@Oli I checked. Yes it is 5,000. Thanks for your precious help:pray:
By the way, with this implementation does the trainloader picks random 500 samples during each epoch or it is sampled only once at start?

Oli · October 15, 2019, 3:35pm

@Ajinkya_Ambatwar You’re welcome

It samples it only once at start, but it can be different each time you run your python file. If you want it to be the same every time you run your python script, you can use this seed function at the very beginning of your program

def seed_program(seed=0):
  ''' Seed for reproducability '''
  random.seed(seed)
  np.random.seed(seed)
  torch.manual_seed(seed)
  torch.cuda.manual_seed_all(seed)
  # torch.backends.cudnn.deterministic = True # You can add this

Ajinkya_Ambatwar · October 15, 2019, 4:15pm

@Oli Yes I am aware of that. Thanks for all the clarification!!
Also I had one another unrelated doubt. How can we add dropout to inbuilt pytorch models like say resnet18?I know that if we are custom building the network from scratch then we can add dropout layers using nn.dropout. But with built in models how can we do that?

Oli · October 15, 2019, 4:30pm

I don’t know a way other than building a custom network. But you don’t have to build it from scratch, the resnet18 code is on github so you can copy that and change the source code

Ajinkya_Ambatwar · October 15, 2019, 4:32pm

Oh thanks…That should help!!