Creating a DataLoader for unsupervised learning (MNIST, SVHN)

I need to solve an unsupervised problem with images from MNIST and SVHN, in which I have 100 images from MNIST and 10 images from SVHN). I need a pre-trained net to learn how to classify if a given image is from MNIST or SVHN (the anomaly). Basically, it’s an anomaly detection problem.
I know I’ll have to tackle that later with integration of a clustering technique/SVM/whatever with the net somehow, or AE, and a proper loss, but first things first :slight_smile:

It’s very easy to create a regular Dataloader for MNIST and SVHN in PyTorch, but in this case I need to create a dataloader that includes a small subset of the two datasets without the labels (that’s the more interesting part here:) ). How can I do that?

I think the easiest approach would be to write a custom Dataset, and load the desired samples inside of __init__.
Since the data format is different for MNIST and CIFAR (size and number of channels), you would also need to specify some dataset-specific transformations.
Here is a small sample code, which could be a good starter:

class MyDataset(Dataset):
    def __init__(self, mnist_transform=None, cifar_transform=None):
        mnist = datasets.MNIST(
            root='./data',
        )
        cifar = datasets.CIFAR10(
            root='./data',
        )
        
        self.mnist_len = 100
        self.cifar_len = 10
        
        rand_idx = torch.randperm(len(mnist.data))[:self.mnist_len]
        self.mnist_data = mnist.data[rand_idx]
        
        rand_idx = torch.randperm(len(cifar.data))[:self.cifar_len]
        self.cifar_data = cifar.data[rand_idx]

        self.mnist_transform = mnist_transform
        self.cifar_transform = cifar_transform
        
    def __getitem__(self, index):
        if index < self.mnist_len:
            x = self.mnist_data[index]
            if self.mnist_transform:
                x = self.mnist_transform(x)
            print('Returning MNIST sample at index {}'.format(index))
            return x
        else:
            index = index - self.mnist_len
            x = self.cifar_data[index]
            if self.cifar_transform:
                x = self.cifar_transform(x)
            print('Returning CIFAR data at index {}'.format(index))
            return x
        
    def __len__(self):
        return self.mnist_len + self.cifar_len

        

mnist_transform = transforms.Compose([
    transforms.ToPILImage(),
    transforms.Resize((32, 32)),
    transforms.Grayscale(num_output_channels=3),
    transforms.ToTensor(),
    transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
])
    
cifar_transform = transforms.Compose([
    transforms.ToPILImage(),
    transforms.ToTensor(),
    transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
])
    
dataset = MyDataset(
    mnist_transform=mnist_transform,
    cifar_transform=cifar_transform
)
1 Like

Looks pretty good as a starter.

If I download the .gz MNIST test images file and SVHN test images, extract them, put them in data/mnist and mnist/svhn/test folders respectively it should create MyDataset as the new dataset with 100 images from MNIST and the next 10 images from svhn, right?

I switched cifar with svhn, and resized the svhn to 32 as well (I’ll probably use a net that was trained on cifar). Any other changes are required? (code below)

Currently it appears like it can’t find the dataset
RuntimeError: Dataset not found. You can use download=True to download it

import torch

import torchvision.transforms as transforms
import torchvision.datasets as datasets
from torch.utils.data import DataLoader, Dataset

class MyDataset(Dataset):
def init(self, mnist_transform=None, svhn_transform=None):
mnist = datasets.MNIST(
root=’./data/mnist’,
)
svhn = datasets.SVHN(
root=’./data/svhn/test’,
)

    self.mnist_len = 100
    self.svhn_len = 10
    
    rand_idx = torch.randperm(len(mnist.data))[:self.mnist_len]
    self.mnist_data = mnist.data[rand_idx]
    
    rand_idx = torch.randperm(len(svhn.data))[:self.svhn_len]
    self.svhn_data = svhn.data[rand_idx]

    self.mnist_transform = mnist_transform
    self.svhn_transform = svhn_transform
    
def __getitem__(self, index):
    if index < self.mnist_len:
        x = self.mnist_data[index]
        if self.mnist_transform:
            x = self.mnist_transform(x)
        print('Returning MNIST sample at index {}'.format(index))
        return x
    else:
        index = index - self.mnist_len
        x = self.svhn_data[index]
        if self.svhn_transform:
            x = self.svhn_transform(x)
        print('Returning SVHN data at index {}'.format(index))
        return x
    
def __len__(self):
    return self.mnist_len + self.svhn_len

mnist_transform = transforms.Compose([
transforms.ToPILImage(),
transforms.Resize((32, 32)),
transforms.Grayscale(num_output_channels=3),
transforms.ToTensor(),
transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
])

svhn_transform = transforms.Compose([
transforms.Resize((32, 32)),
transforms.ToPILImage(),
transforms.ToTensor(),
transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
])

dataset = MyDataset(
mnist_transform=mnist_transform,
svhn_transform=svhn_transform
)

I wouldn’t recommend to download the datasets manually. Instead just pass download=True as the error message says and the files should be automatically downloaded and extracted.

MyDataset will use 100 random MNIST images and 10 random SVHN images.
The indices are defined in rand_idx. You can of course change these indices to specific ones.

1 Like

Yes, it’s better to do it the regular way with download = True.

It looks like it doesn’t recognize MNIST.data

rand_idx = torch.randperm(len(mnist.data))[:self.mnist_len]

AttributeError: 'MNIST' object has no attribute 'data'

if I add # before that line (and the next one) it works well. It’s a bit odd it says it doesn’t have that attribute.

In older torchvision versions, self.train_data and self.test_data was used, so you might want to update torchvision or use the corresponding attribute.

1 Like

Good to know. I prefer to stay with my current version of packages for the meanwhile.
mnist.test_data doesn’t work if that’s what you meant.

Depending if you’ve specified train=True/False during instantiation of your dataset, only the corresponding one will work.

1 Like

True, I forgot to specify that I want the test.
Thank you :slight_smile:

1 Like

I have a trained network plus a feature extractor that returns the output of the last conv layer, but when I try to simply pass the dataset through the model by
output = model(dataset)

I get this error:
TypeError: conv2d(): argument ‘input’ (position 1) must be Tensor, not MyDataset

dataset is MyDataset instance, but I guess the transform should allow me to pass the data through the net. What am I missing here?

You cannot pass the Dataset instance directly to your model, but would have to pass data batches to it.
The recommended way to do this is to wrap your dataset in a DataLoader and iterate over it:

dataset = MyDataset(...)
loader = DataLoader(
    dataset,
    batch_size=1,
    shuffle=True,
    num_workers=2
)

for data in loader:
    output = model(data)
    ...

Note that I just set some arguments while creating the DataLoader, so you might want to change e.g. the batch size etc.

1 Like

Thanks, it’s more clear now. But it seems like getitem is not defined properly, it fails to access self.MNIST, as MyDataset has no MNIST attribute.

if index < self.MNIST:

AttributeError: Traceback (most recent call last):
File “/home/nimrod/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py”, line 57, in _worker_loop
samples = collate_fn([dataset[i] for i in batch_indices])
File “/home/nimrod/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py”, line 57, in
samples = collate_fn([dataset[i] for i in batch_indices])
File “”, line 27, in getitem
if index < self.MNIST:
AttributeError: ‘MyDataset’ object has no attribute ‘MNIST’

I tried to replace self.MNIST with self.mnist_len but it didn’t help, I get that MNIST object is not callable:
TypeError: Traceback (most recent call last):
File “/home/nimrod/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py”, line 57, in _worker_loop
samples = collate_fn([dataset[i] for i in batch_indices])
File “/home/nimrod/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py”, line 57, in
samples = collate_fn([dataset[i] for i in batch_indices])
File “”, line 30, in getitem
x = self.mnist_transform(x)
TypeError: ‘MNIST’ object is not callable

My code should work, if you just change the mnist.data attribute for mnist.test_data, if you are using the test set in an older torchvision version.
What else did you change? Could you compare your code to mine?

Exactly the same code:

class MyDataset(Dataset):
    def __init__(self, mnist_transform=None, svhn_transform=None):
        mnist = datasets.MNIST(
            root='./data/mnist',
            train=False,
            download= True
        )
                
        svhn = datasets.SVHN(
            root='./data/svhn',
            download=True
        )
        
        self.mnist_len = 100
        self.svhn_len = 10
        
        rand_idx = torch.randperm(len(mnist.test_data))[:self.mnist_len]
        self.mnist_data = mnist.test_data[rand_idx]
        
        rand_idx = torch.randperm(len(svhn.data))[:self.svhn_len]
        self.svhn_data = svhn.data[rand_idx]

        self.mnist_transform = mnist
        self.svhn_transform = svhn_transform
        
    def __getitem__(self, index):
        if index < self.mnist_len:
            x = self.mnist_data[index]
            if self.mnist_transform:
                x = self.mnist_transform(x)
            print('Returning MNIST sample at index {}'.format(index))
            return x
        else:
            index = index - self.mnist_len
            x = self.svhn_data[index]
            if self.svhn_transform:
                x = self.svhn_transform(x)
            print('Returning SVHN data at index {}'.format(index))
            return x
        
    def __len__(self):
        return self.mnist_len + self.svhn_len

        

mnist_transform = transforms.Compose([
    transforms.ToPILImage(),
    transforms.Resize((32, 32)),
    transforms.Grayscale(num_output_channels=3),
    transforms.ToTensor(),
    transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
])
    
svhn_transform = transforms.Compose([
    transforms.Resize((32, 32)),
    transforms.ToPILImage(),
    transforms.ToTensor(),
    transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
])
    
dataset_new = MyDataset(
    mnist_transform=mnist_transform,
    svhn_transform=svhn_transform
)


loader = DataLoader(
    dataset_new,
    batch_size=100,
    shuffle=True,
    num_workers=4
)

for data in loader:
    output = model2(data)
    print(output)

TypeError: ‘MNIST’ object is not callable

Apparently, these lines were changed:

if index < self.MNIST:
# should be
if index < self.mnist_len:

self.mnist_transform = mnist
# should be
self.mnist_transform = mnist_transform
1 Like

Yeah, I missed that. I probably accidentally changed that.
But unfortunately now it says that it gets .format(type(pic)) rather than a Tensor or ndarray. Sounds like something is wrong with the transform, but it looks ok. The last two lines in the trace are quite odd, it looks like they contradict each other (probably a mistake, I guess):

raise TypeError(‘pic should be Tensor or ndarray. Got {}.’.format(type(pic)))

TypeError: pic should be Tensor or ndarray. Got <class ‘torch.Tensor’>.

The full trace:
File “/home/nimrod/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py”, line 307, in _process_next_batch
raise batch.exc_type(batch.exc_msg)

TypeError: Traceback (most recent call last):
File “/home/nimrod/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py”, line 57, in _worker_loop
samples = collate_fn([dataset[i] for i in batch_indices])
File “/home/nimrod/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py”, line 57, in
samples = collate_fn([dataset[i] for i in batch_indices])
File “”, line 30, in getitem
x = self.mnist_transform(x)
File “/home/nimrod/anaconda3/lib/python3.6/site-packages/torchvision/transforms/transforms.py”, line 49, in call
img = t(img)
File “/home/nimrod/anaconda3/lib/python3.6/site-packages/torchvision/transforms/transforms.py”, line 110, in call
return F.to_pil_image(pic, self.mode)
File “/home/nimrod/anaconda3/lib/python3.6/site-packages/torchvision/transforms/functional.py”, line 103, in to_pil_image
raise TypeError(‘pic should be Tensor or ndarray. Got {}.’.format(type(pic)))
TypeError: pic should be Tensor or ndarray. Got <class ‘torch.Tensor’>.

Could you try to unsqueeze the data sample by using:

x = self.mnist_data[index]
if self.mnist_transform:
    x = self.mnist_transform(x.unsqueeze(0))

I assume this error is thrown, if the channel dimension in your image tensor is missing (MNIST should be a [28, 28] shaped tensor), which is fixed in the latest torchvision version.

It makes sense that it might cause a problem, though transforms.Grayscale(num_output_channels=3) should deal with channels.
This change results in a quite similar error:

TypeError: Traceback (most recent call last):
File “/home/nimrod/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py”, line 57, in _worker_loop
samples = collate_fn([dataset[i] for i in batch_indices])
File “/home/nimrod/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py”, line 57, in
samples = collate_fn([dataset[i] for i in batch_indices])
File “”, line 37, in getitem
x = self.svhn_transform(x)
File “/home/nimrod/anaconda3/lib/python3.6/site-packages/torchvision/transforms/transforms.py”, line 49, in call
img = t(img)
File “/home/nimrod/anaconda3/lib/python3.6/site-packages/torchvision/transforms/transforms.py”, line 175, in call
return F.resize(img, self.size, self.interpolation)
File “/home/nimrod/anaconda3/lib/python3.6/site-packages/torchvision/transforms/functional.py”, line 189, in resize
raise TypeError(‘img should be PIL Image. Got {}’.format(type(img)))
TypeError: img should be PIL Image. Got <class ‘numpy.ndarray’>

Grayscale(num_output_channels=3) is used after ToPILImage(), which was throwing this error.
The new issue is for the svhn_transform. Apparently you are using Resize before ToPILImage. Could you change the order and run it again?

1 Like

It works now! Good to know that ToPILImage must precede Resize.

Finally the Dataloader works fine and I can pass the data through the model. Thanks a lot for your patience! :slight_smile:

1 Like