How can I use both system memory and gpu memory efficiently to speed up training?

Oliver · March 5, 2023, 12:37pm

Hi all,

I am using colab and sometimes kaggle instances and noticed that the GPU memory is often quite full but the system memory rarely runs above 4 gb and has a lot of space.

Is there some way that I can use the system memory more efficiently?

I am reading images from the disk, should I prefetch some and put them in memory?
Is there a function that I am missing?
Does prefetch do that (I understood from the github and the docs that it would prefetch into GPU memory not system memory, did I understand correctly?)?

Thank you all in advance for any tips you can offer!

ptrblck · March 6, 2023, 12:10am

The DataLoader will not move (or prefetch) data on the GPU by default and depends on the behavior implemented in the Dataset.__getitem__ method. Usually you would load each sample into the host RAM and thus the DataLoader’s workers will also prefetch these batches on the host.
Besides that you should note that moving data between the CPU and GPU can be quite expensive. While CPU offloading is available in PyTorch (which moves intermediates to the CPU to save GPU memory), this util. is used to allow the training of large models which would otherwise not fit into the GPU, which does not seem to be the case here.

Oliver · March 6, 2023, 8:24am

Thanks a lot for the explanation, I think I got it wrong the first time.
So to make sure I understood correctly, I’m using cv2.imread (is there a benefit of using torchvisions native read_image?), I have set pin_to_memory = True, so the images I prefetch are in RAM and do not get loaded into the GPU queued up for processing?
In this case, since I am only using a little RAM, increasing the prefetch_factor which I understand defaulting to 2, could increase performance? I am going to try this.

ptrblck · March 6, 2023, 7:54pm

Your understanding about the loading to the host RAM is correct.
Increasing the prefetch_factor won’t necessarily improve the performance if the queue is already filled and the training is not consuming the prefetched batches fast enough.

tiramisuNcustard · March 7, 2023, 1:22am

pin_memory=True allocates memory (page locked). This speeds up copy to GPU. See https://pytorch.org/docs/stable/notes/cuda.html#use-pinned-memory-buffers