Though it is possible to slice on most datasets, we can not slice on this one:
digit_12_to_14 = mnist_ds[12:15]
will return
ValueError: Too many dimensions: 3 > 2.
This is due to a Image.fromarray() in the getItem()
Is it possible to use MNIST dataset without using a Dataloader ? How ?
PS: The reason why I would like to avoid using Dataloader is that sending batches one at a time to the GPU slow down the training. I prefer to send to the GPU the whole data just one time. For this I need to have access to the whole TRANSFORMED dataset.
print("\nFirst...")
st = time()
x_all_ts = torch.tensor([mnist_ds[i][0].numpy() for i in range(0, len(mnist_ds))])
t_all_ts = torch.tensor([mnist_ds[i][1] for i in range(0, len(mnist_ds))])
print(f"{time()-st} images:{x_all_ts.size()} targets:{t_all_ts.size()} ")
print("\nSecond...")
st = time()
dl = DataLoader(dataset=mnist_ds, batch_size=len(mnist_ds))
X, T = list(dl)[0]
print(f"{time()-st} images:{X.size()} targets:{T.size()} ")
First...
42.57414126396179 images:torch.Size([60000, 1, 16, 16]) targets:torch.Size([60000])
Second...
17.22612690925598 images:torch.Size([60000, 1, 16, 16]) targets:torch.Size([60000])
By the way, it makes the training much faster to put MNIST on the GPU just once and manually create batches in the training loop.
I am looking forward a transparent way to do so from pytorch, that is a way to put the dataset on the GPU before training and be able to use the DataLoader in the training loop.
It should work, if you push all the data in your Dataset onto the GPU, and use num_workers=0 as well as pin_memory=False in your DataLoader.
You could write your own Dataset of if you would like to use the torchvision.dataset.MNIST one, you could manipulate the underlying dataset.data and dataset.targets.
I guess you meant dataset.train_data and dataset.train_labels
Using dataset.train_labels is a great idea. It’s exaltly what I wanted.
But as you know, dataset.train_data is not transformed. MNIST images are transformed in getitem() and though I could develop my own dataset, I like the simplicity of torchvision MNIST.
So my updated solution is:
x_all_ts = torch.tensor([mnist_ds[i][0].numpy() for i in range(0, len(mnist_ds))])
t_all_ts = mnist_ds.train_labels
Regarding your first sentence , I am a bit confused. How to push the MNIST dataset onto the GPU? did you mean push dataset.train_data ?
Yeah, I’ve built torchvision from source, where these names were modified (source).
You are right, if you need to transform the PIL.Images, my suggestion won’t work.
I had in mind to push the already transformed tensors onto the GPU.
However, I’m not even sure if you’ll really see a performance benefit compared to push the data onto the GPU in your training loop using multiple workers.
On my tiny laptop GPU, in a training loop, pushing each batch one at a time onto the GPU with default number of workers (I guess 1) takes 2 or 3 more time than pushing the whole dataset once then manually creating batches in the traning loop. Same thing on google cobab GPU.
I will try to play with mutliples workers and pin_memory on google colab.
I agree with you. I have tried to push my whole validation one time onto the GPU and then push each batch one (bacth size is 100)at a time onto the GPU, the result shows what you have said is correct, the former is exactly faster than the latter
While this approach might be faster for tiny datasets and small models, you are using your precious GPU memory to store all the data.
In my opinion the standard workflow is to save the GPU memory for your (large) model and to only push the current batch of data as needed. If tutorials teach you to push all data onto the GPU I would consider that an edge case (small data, small model).
(Tutorials I read teach NOT to push all data onto the GPU)
I understand things better now. In fact, you convinced me to give up my approach and adopt best practices right now (even on MNIST or CIFAR10). Thanks a million for your explanation and time.