What do TensorDataset and DataLoader do?

Jude_Capachietti · December 24, 2020, 3:07am

I am used to using numpy arrays in the form X,y and fitting a model to those. I can’t understand what Datasets and Dataloaders do to the X and y vectors. I have searched on the internet a fair amount and I still cannot figure out what those functions do.

I am hoping someone on here can give me a simple quick explanation of what these functions do and are for. Here’s an example of where how I use these functions:

trainset = torch.utils.data.TensorDataset(X_train, y_train)

trainloader = torch.utils.data.DataLoader(trainset, batch_size=128, shuffle=True, drop_last=True)

Why do I do these to load data into a Neural Network, instead of just using X_train, y_train?

ptrblck · January 5, 2021, 9:02am

You can use the plain tensors as X_train and y_train, if you are able to load them completely (and push to the GPU without sacrificing too much memory).
The Dataset is ab abstraction to be able to load and process each sample of your dataset lazily, while the DataLoader takes care of shuffling/sampling/weigthed sampling, batching, using multiprocessing to load the data, use pinned memory etc.
This tutorial might be helpful to see the advantages of using this approach.
That being said, you are of course fine to use the tensors directly, which might also be faster if you are using a tiny dataset.

Jude_Capachietti · January 11, 2021, 2:42pm

Hey thank you so much ptrblck. This is very helpful to my understanding! I feel it is good for me to understand what these functions do, so it is not like some black box that I am too scared to touch!

I see the difference now, and I will use a DataLoader now for the helpful reasons you mentioned.

All the best!