No, the DataLoader
will load each sample from Dataset.__getitem__
and use the collate_fn
to create a batch out of these samples. It has no knowledge, if these tensors are on the CPU or GPU.
Your current approach of moving the data to the GPU in your collate_fn
should have the same effect as moving the data to the device inside the training loop, since each batch will be pushed to the device, not the complete dataset.
For general information about data loading bottlenecks, I would recommend to have a look at this post.