Loading the complete dataset into memory might work for you, but won’t certainly work for a lot of use cases, e.g. dealing with 10,000,000 images.
Also, the initial loading of the whole dataset might slow down your iteration speed.
I.e. if you are still experimenting with your code and would like to iterate quickly, waiting several minutes for the data loading just to see a code error might be annoying. Of course this can be avoided by loading only a subset of the data, but would need additional code.
Other benefits, e.g. shuffling and batching, will also be missing and you again would have to implement it manually.
The idea behind the
DataLoader is to load your data using multiprocessing (and pinned memory) to asynchronously push your data batch onto the GPU during training so that you can basically hide the data loading time. This is of course the optimal use case and if you are working with a slow HDD, you will most likely notice the data loading time.
Anyway, the choice is of course yours, as PyTorch does not depend on a
DataLoader usage to work properly.
TensorDataset is a convenient method to wrap already loaded tensors into a
Dataset and e.g. to use a
Subset or wrap it in a