How to create series of training examples from a custom dataset?

torc · March 18, 2021, 11:07am

For datasets that come integrated with PyTorch, this is very easy.

In the case of MNIST, doing this is enough-

from torchvision import datasets

train_data = datatsets.MNIST('.', train=True, download=True)

x_train, y_train = train_data.data, train_data.targets

But how do I create such series from custom datasets that I create using classes subclassing from torch.utils.data.Dataset?

I have a dataset having a series of tuples of tensors and labels. The tensors are shaped 180*180 and the labels are integers.

>>> dataset[0]
(tensor([[1.5628, 1.5679, 1.5588,  ..., 1.6395, 1.6355, 1.6354],
         [1.5106, 1.5402, 1.5627,  ..., 1.5813, 1.6235, 1.6520],
         [1.5924, 1.6069, 1.5967,  ..., 1.5813, 1.5924, 1.5964],
         ...,
         [1.5945, 1.6138, 1.6241,  ..., 1.6181, 1.6243, 1.6018],
         [1.6006, 1.6283, 1.6591,  ..., 1.6047, 1.6047, 1.6161],
         [1.6181, 1.6181, 1.6129,  ..., 1.5833, 1.5679, 1.6110]]),
 5)

How do I go, from here, to create say, x of torch.Size([5000, 180, 180]), and y of torch.size([5000])?

I then want to create a TensorDataset from here, from where to finally form a DataLoader.

This is a newbie question, and I cannot find the answer online.

ptrblck · March 19, 2021, 6:21am

I’m not completely sure I understand the question correctly, so feel free to update the description a bit and correct me, if I’m misunderstanding it.
In case you want to access the complete dataset and are preloading it in the __init__ or are passing it directly to it, you should be able to access the internal attributes via e.g. dataset.data.
On the other hand, if you want to create batches of 5000 samples, you can pass this dataset to a DataLoader with a batch_size=5000, which should then return the desired shapes.

torc · March 22, 2021, 6:07am

I actually thought that I should create a TensorDataset to create DataLoader from it.

But, turns out, I can create a DataLoader directly from the dataset.