Pin memory vs sending direct to GPU from dataset

DuaneNielsen · January 5, 2019, 10:03pm

I’m implementing an atari pong playing policy gradient agent.

The main loop at the moment is.

Run the policy to gather some experience and save the experience (images, actions, rewards) in a dataset.
Run a single iteration of training over the gathered experience to update the policy.
Throw all the data away and start again.

Up until now, I have just been using cpu training, but now I’d like to push the training to GPU.

When it comes to loading It seems I have a matrix of choices.

Create tensors on CPU, then push them to GPU via pinned memory.
Create tensors directly on GPU.

"1. Create tensors on the get_item(index) of the DataSet
2. Create tensors in the collate_batch function of the DataLoader (or write my own DataLoader)

I thought I understood all this, but as it turns out, there are a few gaps in my understanding. I have 3 questions.

Is there any downside to directly creating GPU tensors in the get_item(index) call on the DataSet. Is this a bad idea?

What’s the best practice for creating tensors generally, should they created for each item in the dataset, or should the dataset just return what it returns, and the loader take care of creating tensors?

Where is the right place to decide which device a tensor goes to? The DataSet? The DataLoader, or the training loop itself?

Any insight appreciated!

ptrblck · January 5, 2019, 10:20pm

The downside of creating GPU tensors in __getitem__ or push CPU tensors onto the device is that your DataLoader won’t be able to use multiple workers anymore.
If you try to set num_workers > 0, you’ll get a CUDA error:

RuntimeError: CUDA error: initialization error

This also means that your host and device operations (most likely) won’t be able to overlap anymore, i.e. your CPUs cannot load and process the data while your GPU is busy training the model.

If you are “creating” the tensors, i.e. sampling them, this could still be a valid approach.
However, if you are loading and processing some data (e.g. images), I would write the Dataset such that a single example is loaded, processed and returned as a CPUTensor.
If you use pin_memory=True in you DataLoader, the transfer from host to device will be faster as described in this blogpost.
Inside the training loop you would push the tensors onto the GPU. If you set non_blocking=True as an argument in tensor.to(), PyTorch will try to perform the transfer asynchronously as decribed here.

The DataLoader might use a sampler or a custom collate_fn, but shouldn’t be responsible of creating the tensors.

DuaneNielsen · January 5, 2019, 10:30pm

That’s incredibly helpful (as usual). Thanks Peter!

raunaks · July 17, 2019, 5:32pm

@ptrblck I had a question related to this. Actually I am loading image filenames in my dataset, and loading the images in __get_item__. I also use manually written data augmentation functions in __get_item__, so wouldn’t it be better to have these run on the GPU? Currently training is pretty slow.

Now I think I have 2 options - either I write the data augmentation functions for batched tensors outside the DataLoader, or I can create GPU tensors in __get_item__.

What would you recommend?

ptrblck · July 18, 2019, 11:26am

What kind of data augmentation methods are you currently using?
If you are using PIL, you might want to install PIL-SIMD, which is a drop-in replacement and will use SIMD operations to speed up the transformations.
Are you using multiple workers in your DataLoader? This should usually speeding up the data loading and processing.

If you are bottlenecked by the CPU and have enough GPU resources, you might also try to use NVIDIA/DALI. @JanuszL might give you some more information about it. Have a look at his post here for some additional information.