Pin memory vs sending direct to GPU from dataset

The downside of creating GPU tensors in __getitem__ or push CPU tensors onto the device is that your DataLoader won’t be able to use multiple workers anymore.
If you try to set num_workers > 0, you’ll get a CUDA error:

RuntimeError: CUDA error: initialization error

This also means that your host and device operations (most likely) won’t be able to overlap anymore, i.e. your CPUs cannot load and process the data while your GPU is busy training the model.

If you are “creating” the tensors, i.e. sampling them, this could still be a valid approach.
However, if you are loading and processing some data (e.g. images), I would write the Dataset such that a single example is loaded, processed and returned as a CPUTensor.
If you use pin_memory=True in you DataLoader, the transfer from host to device will be faster as described in this blogpost.
Inside the training loop you would push the tensors onto the GPU. If you set non_blocking=True as an argument in tensor.to(), PyTorch will try to perform the transfer asynchronously as decribed here.

The DataLoader might use a sampler or a custom collate_fn, but shouldn’t be responsible of creating the tensors.

16 Likes