Creating a DataLoader for unsupervised learning (MNIST, SVHN)

Nimrod_Daniel · May 29, 2019, 9:57am

I need to solve an unsupervised problem with images from MNIST and SVHN, in which I have 100 images from MNIST and 10 images from SVHN). I need a pre-trained net to learn how to classify if a given image is from MNIST or SVHN (the anomaly). Basically, it’s an anomaly detection problem.
I know I’ll have to tackle that later with integration of a clustering technique/SVM/whatever with the net somehow, or AE, and a proper loss, but first things first

It’s very easy to create a regular Dataloader for MNIST and SVHN in PyTorch, but in this case I need to create a dataloader that includes a small subset of the two datasets without the labels (that’s the more interesting part here:) ). How can I do that?

ptrblck · May 29, 2019, 11:38am

I think the easiest approach would be to write a custom Dataset, and load the desired samples inside of __init__.
Since the data format is different for MNIST and CIFAR (size and number of channels), you would also need to specify some dataset-specific transformations.
Here is a small sample code, which could be a good starter:

class MyDataset(Dataset):
    def __init__(self, mnist_transform=None, cifar_transform=None):
        mnist = datasets.MNIST(
            root='./data',
        )
        cifar = datasets.CIFAR10(
            root='./data',
        )
        
        self.mnist_len = 100
        self.cifar_len = 10
        
        rand_idx = torch.randperm(len(mnist.data))[:self.mnist_len]
        self.mnist_data = mnist.data[rand_idx]
        
        rand_idx = torch.randperm(len(cifar.data))[:self.cifar_len]
        self.cifar_data = cifar.data[rand_idx]

        self.mnist_transform = mnist_transform
        self.cifar_transform = cifar_transform
        
    def __getitem__(self, index):
        if index < self.mnist_len:
            x = self.mnist_data[index]
            if self.mnist_transform:
                x = self.mnist_transform(x)
            print('Returning MNIST sample at index {}'.format(index))
            return x
        else:
            index = index - self.mnist_len
            x = self.cifar_data[index]
            if self.cifar_transform:
                x = self.cifar_transform(x)
            print('Returning CIFAR data at index {}'.format(index))
            return x
        
    def __len__(self):
        return self.mnist_len + self.cifar_len

        

mnist_transform = transforms.Compose([
    transforms.ToPILImage(),
    transforms.Resize((32, 32)),
    transforms.Grayscale(num_output_channels=3),
    transforms.ToTensor(),
    transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
])
    
cifar_transform = transforms.Compose([
    transforms.ToPILImage(),
    transforms.ToTensor(),
    transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
])
    
dataset = MyDataset(
    mnist_transform=mnist_transform,
    cifar_transform=cifar_transform
)

Nimrod_Daniel · May 29, 2019, 1:25pm

Looks pretty good as a starter.

If I download the .gz MNIST test images file and SVHN test images, extract them, put them in data/mnist and mnist/svhn/test folders respectively it should create MyDataset as the new dataset with 100 images from MNIST and the next 10 images from svhn, right?

I switched cifar with svhn, and resized the svhn to 32 as well (I’ll probably use a net that was trained on cifar). Any other changes are required? (code below)

Currently it appears like it can’t find the dataset
RuntimeError: Dataset not found. You can use download=True to download it

import torch

import torchvision.transforms as transforms
import torchvision.datasets as datasets
from torch.utils.data import DataLoader, Dataset

class MyDataset(Dataset):
def init(self, mnist_transform=None, svhn_transform=None):
mnist = datasets.MNIST(
root=’./data/mnist’,
)
svhn = datasets.SVHN(
root=’./data/svhn/test’,
)

    self.mnist_len = 100
    self.svhn_len = 10
    
    rand_idx = torch.randperm(len(mnist.data))[:self.mnist_len]
    self.mnist_data = mnist.data[rand_idx]
    
    rand_idx = torch.randperm(len(svhn.data))[:self.svhn_len]
    self.svhn_data = svhn.data[rand_idx]

    self.mnist_transform = mnist_transform
    self.svhn_transform = svhn_transform
    
def __getitem__(self, index):
    if index < self.mnist_len:
        x = self.mnist_data[index]
        if self.mnist_transform:
            x = self.mnist_transform(x)
        print('Returning MNIST sample at index {}'.format(index))
        return x
    else:
        index = index - self.mnist_len
        x = self.svhn_data[index]
        if self.svhn_transform:
            x = self.svhn_transform(x)
        print('Returning SVHN data at index {}'.format(index))
        return x
    
def __len__(self):
    return self.mnist_len + self.svhn_len

mnist_transform = transforms.Compose([
transforms.ToPILImage(),
transforms.Resize((32, 32)),
transforms.Grayscale(num_output_channels=3),
transforms.ToTensor(),
transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
])

svhn_transform = transforms.Compose([
transforms.Resize((32, 32)),
transforms.ToPILImage(),
transforms.ToTensor(),
transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
])

dataset = MyDataset(
mnist_transform=mnist_transform,
svhn_transform=svhn_transform
)

ptrblck · May 29, 2019, 1:28pm

I wouldn’t recommend to download the datasets manually. Instead just pass download=True as the error message says and the files should be automatically downloaded and extracted.

MyDataset will use 100 random MNIST images and 10 random SVHN images.
The indices are defined in rand_idx. You can of course change these indices to specific ones.

Nimrod_Daniel · May 29, 2019, 1:46pm

Yes, it’s better to do it the regular way with download = True.

It looks like it doesn’t recognize MNIST.data

rand_idx = torch.randperm(len(mnist.data))[:self.mnist_len]

AttributeError: 'MNIST' object has no attribute 'data'

if I add # before that line (and the next one) it works well. It’s a bit odd it says it doesn’t have that attribute.

ptrblck · May 29, 2019, 1:48pm

In older torchvision versions, self.train_data and self.test_data was used, so you might want to update torchvision or use the corresponding attribute.

Nimrod_Daniel · May 29, 2019, 2:07pm

Good to know. I prefer to stay with my current version of packages for the meanwhile.
mnist.test_data doesn’t work if that’s what you meant.

ptrblck · May 29, 2019, 2:08pm

Depending if you’ve specified train=True/False during instantiation of your dataset, only the corresponding one will work.

Nimrod_Daniel · May 29, 2019, 2:20pm

True, I forgot to specify that I want the test.
Thank you

Nimrod_Daniel · May 29, 2019, 5:03pm

I have a trained network plus a feature extractor that returns the output of the last conv layer, but when I try to simply pass the dataset through the model by
output = model(dataset)

I get this error:
TypeError: conv2d(): argument ‘input’ (position 1) must be Tensor, not MyDataset

dataset is MyDataset instance, but I guess the transform should allow me to pass the data through the net. What am I missing here?

ptrblck · May 29, 2019, 7:56pm

You cannot pass the Dataset instance directly to your model, but would have to pass data batches to it.
The recommended way to do this is to wrap your dataset in a DataLoader and iterate over it:

dataset = MyDataset(...)
loader = DataLoader(
    dataset,
    batch_size=1,
    shuffle=True,
    num_workers=2
)

for data in loader:
    output = model(data)
    ...

Note that I just set some arguments while creating the DataLoader, so you might want to change e.g. the batch size etc.

Nimrod_Daniel · May 30, 2019, 6:32am

Thanks, it’s more clear now. But it seems like getitem is not defined properly, it fails to access self.MNIST, as MyDataset has no MNIST attribute.

if index < self.MNIST:

AttributeError: Traceback (most recent call last):
File “/home/nimrod/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py”, line 57, in _worker_loop
samples = collate_fn([dataset[i] for i in batch_indices])
File “/home/nimrod/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py”, line 57, in
samples = collate_fn([dataset[i] for i in batch_indices])
File “”, line 27, in getitem
if index < self.MNIST:
AttributeError: ‘MyDataset’ object has no attribute ‘MNIST’

I tried to replace self.MNIST with self.mnist_len but it didn’t help, I get that MNIST object is not callable:
TypeError: Traceback (most recent call last):
File “/home/nimrod/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py”, line 57, in _worker_loop
samples = collate_fn([dataset[i] for i in batch_indices])
File “/home/nimrod/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py”, line 57, in
samples = collate_fn([dataset[i] for i in batch_indices])
File “”, line 30, in getitem
x = self.mnist_transform(x)
TypeError: ‘MNIST’ object is not callable

ptrblck · May 30, 2019, 9:45am

My code should work, if you just change the mnist.data attribute for mnist.test_data, if you are using the test set in an older torchvision version.
What else did you change? Could you compare your code to mine?

Nimrod_Daniel · May 30, 2019, 9:56am

Exactly the same code:

class MyDataset(Dataset):
    def __init__(self, mnist_transform=None, svhn_transform=None):
        mnist = datasets.MNIST(
            root='./data/mnist',
            train=False,
            download= True
        )
                
        svhn = datasets.SVHN(
            root='./data/svhn',
            download=True
        )
        
        self.mnist_len = 100
        self.svhn_len = 10
        
        rand_idx = torch.randperm(len(mnist.test_data))[:self.mnist_len]
        self.mnist_data = mnist.test_data[rand_idx]
        
        rand_idx = torch.randperm(len(svhn.data))[:self.svhn_len]
        self.svhn_data = svhn.data[rand_idx]

        self.mnist_transform = mnist
        self.svhn_transform = svhn_transform
        
    def __getitem__(self, index):
        if index < self.mnist_len:
            x = self.mnist_data[index]
            if self.mnist_transform:
                x = self.mnist_transform(x)
            print('Returning MNIST sample at index {}'.format(index))
            return x
        else:
            index = index - self.mnist_len
            x = self.svhn_data[index]
            if self.svhn_transform:
                x = self.svhn_transform(x)
            print('Returning SVHN data at index {}'.format(index))
            return x
        
    def __len__(self):
        return self.mnist_len + self.svhn_len

        

mnist_transform = transforms.Compose([
    transforms.ToPILImage(),
    transforms.Resize((32, 32)),
    transforms.Grayscale(num_output_channels=3),
    transforms.ToTensor(),
    transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
])
    
svhn_transform = transforms.Compose([
    transforms.Resize((32, 32)),
    transforms.ToPILImage(),
    transforms.ToTensor(),
    transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
])
    
dataset_new = MyDataset(
    mnist_transform=mnist_transform,
    svhn_transform=svhn_transform
)


loader = DataLoader(
    dataset_new,
    batch_size=100,
    shuffle=True,
    num_workers=4
)

for data in loader:
    output = model2(data)
    print(output)

TypeError: ‘MNIST’ object is not callable

ptrblck · May 30, 2019, 10:10am

Apparently, these lines were changed:

if index < self.MNIST:
# should be
if index < self.mnist_len:

self.mnist_transform = mnist
# should be
self.mnist_transform = mnist_transform

Nimrod_Daniel · May 30, 2019, 10:24am

Yeah, I missed that. I probably accidentally changed that.
But unfortunately now it says that it gets .format(type(pic)) rather than a Tensor or ndarray. Sounds like something is wrong with the transform, but it looks ok. The last two lines in the trace are quite odd, it looks like they contradict each other (probably a mistake, I guess):

raise TypeError(‘pic should be Tensor or ndarray. Got {}.’.format(type(pic)))

TypeError: pic should be Tensor or ndarray. Got <class ‘torch.Tensor’>.

The full trace:
File “/home/nimrod/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py”, line 307, in _process_next_batch
raise batch.exc_type(batch.exc_msg)

TypeError: Traceback (most recent call last):
File “/home/nimrod/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py”, line 57, in _worker_loop
samples = collate_fn([dataset[i] for i in batch_indices])
File “/home/nimrod/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py”, line 57, in
samples = collate_fn([dataset[i] for i in batch_indices])
File “”, line 30, in getitem
x = self.mnist_transform(x)
File “/home/nimrod/anaconda3/lib/python3.6/site-packages/torchvision/transforms/transforms.py”, line 49, in call
img = t(img)
File “/home/nimrod/anaconda3/lib/python3.6/site-packages/torchvision/transforms/transforms.py”, line 110, in call
return F.to_pil_image(pic, self.mode)
File “/home/nimrod/anaconda3/lib/python3.6/site-packages/torchvision/transforms/functional.py”, line 103, in to_pil_image
raise TypeError(‘pic should be Tensor or ndarray. Got {}.’.format(type(pic)))
TypeError: pic should be Tensor or ndarray. Got <class ‘torch.Tensor’>.

ptrblck · May 30, 2019, 10:45am

Could you try to unsqueeze the data sample by using:

x = self.mnist_data[index]
if self.mnist_transform:
    x = self.mnist_transform(x.unsqueeze(0))

I assume this error is thrown, if the channel dimension in your image tensor is missing (MNIST should be a [28, 28] shaped tensor), which is fixed in the latest torchvision version.

Nimrod_Daniel · May 30, 2019, 10:56am

It makes sense that it might cause a problem, though transforms.Grayscale(num_output_channels=3) should deal with channels.
This change results in a quite similar error:

TypeError: Traceback (most recent call last):
File “/home/nimrod/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py”, line 57, in _worker_loop
samples = collate_fn([dataset[i] for i in batch_indices])
File “/home/nimrod/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py”, line 57, in
samples = collate_fn([dataset[i] for i in batch_indices])
File “”, line 37, in getitem
x = self.svhn_transform(x)
File “/home/nimrod/anaconda3/lib/python3.6/site-packages/torchvision/transforms/transforms.py”, line 49, in call
img = t(img)
File “/home/nimrod/anaconda3/lib/python3.6/site-packages/torchvision/transforms/transforms.py”, line 175, in call
return F.resize(img, self.size, self.interpolation)
File “/home/nimrod/anaconda3/lib/python3.6/site-packages/torchvision/transforms/functional.py”, line 189, in resize
raise TypeError(‘img should be PIL Image. Got {}’.format(type(img)))
TypeError: img should be PIL Image. Got <class ‘numpy.ndarray’>

ptrblck · May 30, 2019, 10:58am

Grayscale(num_output_channels=3) is used after ToPILImage(), which was throwing this error.
The new issue is for the svhn_transform. Apparently you are using Resize before ToPILImage. Could you change the order and run it again?

Nimrod_Daniel · May 30, 2019, 11:03am

It works now! Good to know that ToPILImage must precede Resize.

Finally the Dataloader works fine and I can pass the data through the model. Thanks a lot for your patience!