Making a custom dataset

Dr_John · October 22, 2018, 2:12pm

Assume that I create two datasets that differ by their “getitem” protocol (for example, “dataset1” in the code below gives a denoised version of every image in the original dataset and “dataset2” in the code below gives the original version of the image), and want to create a new dataset, which consists of the first 10 images from the first dataset, and the first 10 images from the second dataset. Trying to run the code below indeed gives a new dataset, but it does not preserve all the variables of the original datasets (such as classes, samples, etc…) that are required for the loading part.

How can I solve this problem?
Thanks!

dataset1 = DatasetFolder1(parameters)
dataset2 = DatasetFolder2(parameters)
dataset = [dataset1[i] for i in range(20) ]
for i in [10:20]:
dataset[i] =dataset2[i]

ptrblck · October 22, 2018, 2:47pm

If both Datasets store the same data in the noisy and original format, you might need to use the same indices.
Currently you are storing the first 20 samples in dataset, then overwrite [10:20] with the “second batch” of dataset2.
Probably this would work:

dataset = [dataset1[i] for i in range(20) ]
for i in [10:20]:
    dataset[i] =dataset2[i-10]

Dr_John · October 22, 2018, 4:35pm

Although this doesn’t answer my question, you are completely right… My code overwrites the last 10 samples of “dataset” with the second batch of dataset2, as opposed to what I was trying to do. Thanks for that.

My problem is that when I use the classes “DatasetFolder1” and “DatasetFolder2” that are derived from “torchvision.datasets.DatasetFolder”, the objects “dataset1” and “dataset2” contain several variables (such as “classes”, “extensions”, “samples”) that are omitted in the new object “dataset” when I use

dataset = [dataset1[i] for i in range(20) ]

As far as I can see, the problem is that the object “dataset” is no longer of type “DatasetFolder” (it is just a list). Is there anyway of just creating a new “DatasetFolder” instance with some samples coming from dataset1 and some from dataset2?

Hope I made myself clear.

Thanks!

ptrblck · October 22, 2018, 6:05pm

I think you could create both Datasets and pass them to a custom Dataset, which concatenates the samples of both underlying Datasets.
I’ve created a small example using ImageFolder:


dataset1 = datasets.ImageFolder(
    root='YOUR_PATH',
    transform=transforms.ToTensor())
dataset2 = datasets.ImageFolder(
    root='YOUR_PATH',
    transform=transforms.ToTensor())


class MyDataset(Dataset):
    def __init__(self, dataset1, dataset2):
        self.dataset1 = dataset1
        self.dataset2 = dataset2
        
    def __getitem__(self, index):
        x1, y1 = self.dataset1[index]
        x2, y2 = self.dataset2[index]
        
        x = torch.stack((x1, x2))
        y = torch.stack((torch.tensor(y1), torch.tensor(y2)))
        
        return x, y
    
    def __len__(self):
        return len(self.dataset1)

dataset = MyDataset(dataset1, dataset2)

loader = DataLoader(
    dataset,
    batch_size=10,
    shuffle=False,
    num_workers=2
)

for data, target in loader:
    data = data.view(-1, *data.size()[2:])
    target = target.view(-1)
    print(data.shape)
    print(target.shape)

As you can see, you will double the actual batch size.
Would that work for you?

Dr_John · October 23, 2018, 7:08am

Your idea worked. Thanks a lot!