I’m implementing an atari pong playing policy gradient agent.
The main loop at the moment is.
- Run the policy to gather some experience and save the experience (images, actions, rewards) in a dataset.
- Run a single iteration of training over the gathered experience to update the policy.
- Throw all the data away and start again.
Up until now, I have just been using cpu training, but now I’d like to push the training to GPU.
When it comes to loading It seems I have a matrix of choices.
- Create tensors on CPU, then push them to GPU via pinned memory.
- Create tensors directly on GPU.
"1. Create tensors on the get_item(index) of the DataSet
2. Create tensors in the collate_batch function of the DataLoader (or write my own DataLoader)
I thought I understood all this, but as it turns out, there are a few gaps in my understanding. I have 3 questions.
Is there any downside to directly creating GPU tensors in the get_item(index) call on the DataSet. Is this a bad idea?
What’s the best practice for creating tensors generally, should they created for each item in the dataset, or should the dataset just return what it returns, and the loader take care of creating tensors?
Where is the right place to decide which device a tensor goes to? The DataSet? The DataLoader, or the training loop itself?
Any insight appreciated!
The downside of creating GPU tensors in
__getitem__ or push CPU tensors onto the device is that your
DataLoader won’t be able to use multiple workers anymore.
If you try to set
num_workers > 0, you’ll get a CUDA error:
RuntimeError: CUDA error: initialization error
This also means that your host and device operations (most likely) won’t be able to overlap anymore, i.e. your CPUs cannot load and process the data while your GPU is busy training the model.
If you are “creating” the tensors, i.e. sampling them, this could still be a valid approach.
However, if you are loading and processing some data (e.g. images), I would write the
Dataset such that a single example is loaded, processed and returned as a CPUTensor.
If you use
pin_memory=True in you
DataLoader, the transfer from host to device will be faster as described in this blogpost.
Inside the training loop you would push the tensors onto the GPU. If you set
non_blocking=True as an argument in
tensor.to(), PyTorch will try to perform the transfer asynchronously as decribed here.
DataLoader might use a
sampler or a custom
collate_fn, but shouldn’t be responsible of creating the tensors.
That’s incredibly helpful (as usual). Thanks Peter!
@ptrblck I had a question related to this. Actually I am loading image filenames in my dataset, and loading the images in
__get_item__. I also use manually written data augmentation functions in
__get_item__, so wouldn’t it be better to have these run on the GPU? Currently training is pretty slow.
Now I think I have 2 options - either I write the data augmentation functions for batched tensors outside the DataLoader, or I can create GPU tensors in
What would you recommend?
What kind of data augmentation methods are you currently using?
If you are using
PIL, you might want to install
PIL-SIMD, which is a drop-in replacement and will use SIMD operations to speed up the transformations.
Are you using multiple workers in your
DataLoader? This should usually speeding up the data loading and processing.
If you are bottlenecked by the CPU and have enough GPU resources, you might also try to use NVIDIA/DALI. @JanuszL might give you some more information about it. Have a look at his post here for some additional information.