Use of Dataloader and TensorDataset

Geeks_Sid · June 5, 2019, 12:28am

Hi all,

So I was wondering the following two things.

What is the use of DataLoader?
What is the use of TensorDataset?

So the scenario begins where I have my current machine with 128GB of RAM.
Is there a particular reason why I should use TensorDataset or DataLoader instead of converting my complete Dataset into a Tensor?
Wouldn’t this be a considerably faster operation and also cause less usage of I/O inturn giving me speedups?

Am I missing something?
Sounds like a good discussing point of pytorch to me.

ptrblck · June 5, 2019, 10:30am

Loading the complete dataset into memory might work for you, but won’t certainly work for a lot of use cases, e.g. dealing with 10,000,000 images.

Also, the initial loading of the whole dataset might slow down your iteration speed.
I.e. if you are still experimenting with your code and would like to iterate quickly, waiting several minutes for the data loading just to see a code error might be annoying. Of course this can be avoided by loading only a subset of the data, but would need additional code.

Other benefits, e.g. shuffling and batching, will also be missing and you again would have to implement it manually.

The idea behind the DataLoader is to load your data using multiprocessing (and pinned memory) to asynchronously push your data batch onto the GPU during training so that you can basically hide the data loading time. This is of course the optimal use case and if you are working with a slow HDD, you will most likely notice the data loading time.

Anyway, the choice is of course yours, as PyTorch does not depend on a Dataset or DataLoader usage to work properly.

The TensorDataset is a convenient method to wrap already loaded tensors into a Dataset and e.g. to use a Subset or wrap it in a DataLoader.

Geeks_Sid · June 5, 2019, 12:36pm

Yes, but let’s say my training set is of 7GB and I iterate over it a for 200 epochs, this means I have traversed through my harddisk or SSD for 1.4TB of data. Instead, if I just load this into the ram for 7~10GB(Considering overheads), would I not save myself from that hassle? Also, batchsize and shuffling can also be manually manipulated.
My final question would then become,
'Is this more energy efficient and will it give me a speedup? ’