Can not slice torchvision MNIST dataset

In Pytorch, when using torchvision MNIST dataset, we can get a digit as follow :

   import torchvision
    import torchvision.transforms as transforms
    from torch.utils.data import DataLoader, Dataset, TensorDataset
    
    tsfm = transforms.Compose([transforms.Resize((16, 16)),
                               transforms.ToTensor(),
                               transforms.Normalize((0.1307,), (0.3081,))])
    
    mnist_ds = torchvision.datasets.MNIST(root='../../../_data/mnist',train=True,
                                    download=True, transform=tsfm)
    
    
    digit_12 = mnist_ds[12]

Though it is possible to slice on most datasets, we can not slice on this one:

digit_12_to_14 = mnist_ds[12:15]

will return

ValueError: Too many dimensions: 3 > 2.

This is due to a Image.fromarray() in the getItem()

Is it possible to use MNIST dataset without using a Dataloader ? How ?

PS: The reason why I would like to avoid using Dataloader is that sending batches one at a time to the GPU slow down the training. I prefer to send to the GPU the whole data just one time. For this I need to have access to the whole TRANSFORMED dataset.

Would a workaround of setting the batch_size to the size of the complete dataset solve this issue?

It did. I I came up with 2 solutions in fact.

print("\nFirst...")
st = time()
x_all_ts = torch.tensor([mnist_ds[i][0].numpy() for i in range(0, len(mnist_ds))])
t_all_ts = torch.tensor([mnist_ds[i][1] for i in range(0, len(mnist_ds))])
print(f"{time()-st}   images:{x_all_ts.size()}  targets:{t_all_ts.size()} ")

print("\nSecond...")
st = time()
dl = DataLoader(dataset=mnist_ds, batch_size=len(mnist_ds))
X, T = list(dl)[0]
print(f"{time()-st}   images:{X.size()}  targets:{T.size()} ")

First...
42.57414126396179   images:torch.Size([60000, 1, 16, 16])  targets:torch.Size([60000]) 
Second...
17.22612690925598   images:torch.Size([60000, 1, 16, 16])  targets:torch.Size([60000])

By the way, it makes the training much faster to put MNIST on the GPU just once and manually create batches in the training loop.

I am looking forward a transparent way to do so from pytorch, that is a way to put the dataset on the GPU before training and be able to use the DataLoader in the training loop.

It should work, if you push all the data in your Dataset onto the GPU, and use num_workers=0 as well as pin_memory=False in your DataLoader.
You could write your own Dataset of if you would like to use the torchvision.dataset.MNIST one, you could manipulate the underlying dataset.data and dataset.targets.

1 Like

I guess you meant dataset.train_data and dataset.train_labels
Using dataset.train_labels is a great idea. It’s exaltly what I wanted.
But as you know, dataset.train_data is not transformed. MNIST images are transformed in getitem() and though I could develop my own dataset, I like the simplicity of torchvision MNIST.
So my updated solution is:

x_all_ts = torch.tensor([mnist_ds[i][0].numpy() for i in range(0, len(mnist_ds))])
t_all_ts = mnist_ds.train_labels

Regarding your first sentence , I am a bit confused. How to push the MNIST dataset onto the GPU? did you mean push dataset.train_data ?

Yeah, I’ve built torchvision from source, where these names were modified (source).
You are right, if you need to transform the PIL.Images, my suggestion won’t work.
I had in mind to push the already transformed tensors onto the GPU.

However, I’m not even sure if you’ll really see a performance benefit compared to push the data onto the GPU in your training loop using multiple workers.

2 Likes

On my tiny laptop GPU, in a training loop, pushing each batch one at a time onto the GPU with default number of workers (I guess 1) takes 2 or 3 more time than pushing the whole dataset once then manually creating batches in the traning loop. Same thing on google cobab GPU.

I will try to play with mutliples workers and pin_memory on google colab.

Thanks for you help. Take care.

I agree with you. I have tried to push my whole validation one time onto the GPU and then push each batch one (bacth size is 100)at a time onto the GPU, the result shows what you have said is correct, the former is exactly faster than the latter

and most tutorials I read use the slow Dataloader solution whereas using something like:

 for i in range((n - 1) // bs + 1):
`   x = x_ts[i * bs: i * bs + bs]`     # x_ts  in GPU

makes it much faster (with one worker of course).

While this approach might be faster for tiny datasets and small models, you are using your precious GPU memory to store all the data.

In my opinion the standard workflow is to save the GPU memory for your (large) model and to only push the current batch of data as needed. If tutorials teach you to push all data onto the GPU I would consider that an edge case (small data, small model).

1 Like

(Tutorials I read teach NOT to push all data onto the GPU)

I understand things better now. In fact, you convinced me to give up my approach and adopt best practices right now (even on MNIST or CIFAR10). Thanks a million for your explanation and time.

1 Like