I am used to using numpy arrays in the form X,y and fitting a model to those. I can’t understand what Datasets and Dataloaders do to the X and y vectors. I have searched on the internet a fair amount and I still cannot figure out what those functions do.
I am hoping someone on here can give me a simple quick explanation of what these functions do and are for. Here’s an example of where how I use these functions:
You can use the plain tensors as X_train and y_train, if you are able to load them completely (and push to the GPU without sacrificing too much memory).
The Dataset is ab abstraction to be able to load and process each sample of your dataset lazily, while the DataLoader takes care of shuffling/sampling/weigthed sampling, batching, using multiprocessing to load the data, use pinned memory etc. This tutorial might be helpful to see the advantages of using this approach.
That being said, you are of course fine to use the tensors directly, which might also be faster if you are using a tiny dataset.
Hey thank you so much ptrblck. This is very helpful to my understanding! I feel it is good for me to understand what these functions do, so it is not like some black box that I am too scared to touch!
I see the difference now, and I will use a DataLoader now for the helpful reasons you mentioned.
Think of them as two layers of abstraction in the PyTorch data pipeline:
TensorDataset: Implements the __getitem__ and __len__ protocols for a set of tensors. It ensures your input data and targets stay aligned
DataLoader: An iterable that abstracts away the complexity of batching, shuffling, and memory pinning (pin_memory=True).
Pro-tip: If you’re moving from NumPy/Scikit-learn, TensorDataset is the direct equivalent of your (X, y) tuple, but optimized for the PyTorch ecosystem.