Should I turn off `pin_memory` when I already loaded the image to the GPU in `__getitem__`?

I had a code that used pin_memory = True and did transformations on images in the __getitem__ of the dataset.

I modified the __getitem__ part of the dataset to be done on the GPU. This, as expected, when run with pin_memory=True gave me the following error :

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/home/connectome/dyhan316/.local/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
    fn(i, *args)
  File "/scratch/connectome/dyhan316/VAE_ADHD/barlowtwins/main_3D.py", line 151, in main_worker
    for step, ((y1, y2), _) in enumerate(loader, start=epoch * len(loader)):
  File "/home/connectome/dyhan316/.local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 530, in __next__
    data = self._next_data()
  File "/home/connectome/dyhan316/.local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1224, in _next_data
    return self._process_data(data)
  File "/home/connectome/dyhan316/.local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1250, in _process_data
    data.reraise()
  File "/home/connectome/dyhan316/.local/lib/python3.8/site-packages/torch/_utils.py", line 457, in reraise
    raise exception
RuntimeError: Caught RuntimeError in pin memory thread for device 0.
Original Traceback (most recent call last):
  File "/home/connectome/dyhan316/.local/lib/python3.8/site-packages/torch/utils/data/_utils/pin_memory.py", line 34, in _pin_memory_loop
    data = pin_memory(data)
  File "/home/connectome/dyhan316/.local/lib/python3.8/site-packages/torch/utils/data/_utils/pin_memory.py", line 65, in pin_memory
    return type(data)([pin_memory(sample) for sample in data])  # type: ignore[call-arg]
  File "/home/connectome/dyhan316/.local/lib/python3.8/site-packages/torch/utils/data/_utils/pin_memory.py", line 65, in <listcomp>
    return type(data)([pin_memory(sample) for sample in data])  # type: ignore[call-arg]
  File "/home/connectome/dyhan316/.local/lib/python3.8/site-packages/torch/utils/data/_utils/pin_memory.py", line 65, in pin_memory
    return type(data)([pin_memory(sample) for sample in data])  # type: ignore[call-arg]
  File "/home/connectome/dyhan316/.local/lib/python3.8/site-packages/torch/utils/data/_utils/pin_memory.py", line 65, in <listcomp>
    return type(data)([pin_memory(sample) for sample in data])  # type: ignore[call-arg]
  File "/home/connectome/dyhan316/.local/lib/python3.8/site-packages/torch/utils/data/_utils/pin_memory.py", line 50, in pin_memory
    return data.pin_memory()
RuntimeError: cannot pin 'torch.cuda.FloatTensor' only dense CPU tensors can be pinned

In this case, am I right in assuming that I should remove the pin_memory = True option, since the image is already loaded to the GPU? Thank you for your insight!

Yes, you cannot ping CUDATensors, as pinned memory is located on the host to allow for a faster and asynchronous data transfer as also mentioned in the error message.

3 Likes

Thank you! I have another related question if you don’t mind : I found that even when data loading is done and actual training is being done, the gpu ram taken up by the data-loader worker does not go away. Is this normal behavior? I wanted to increase the number of workers to make data-loading faster, but this RAM taken up by the data-loader puts constraints on my batch size.

Also, what is the normal design paradigm for doing data loading from GPUs when doing DDP? should I move all my dataloader workers to one GPU and not use that GPU for training (only for data loading?) or should I assign each gpu to a certain amount of workers (which would mean that I can use less batch per gpu)

Would the second option be more optimal since less data has to move back and forth between the GPUs?

pin_memory = True is used to allocate memory on RAM to make reading from / writing to RAM to GPU/CPU faster.

GPU RAM will remain allocated until training is completed. This is due to, again, allocation of resources for the sole purpose of training.

During DDP, a number of separate processes (equal to the number of GPUs, also known as devices) will be created. Each process will have a separate copy of the same ML model and your data will be divided among these GPUs / models for training.
Again, each GPU will have one copy of the same model and data will be divided among these GPUs/models for training, but each copy of the model will share the same gradients after these are updated in the backward pass. These will make training faster.

@tiramisuNcustard Thank you for the detailed reply!!

I have a question if you don’t mind : each augmentation worker takes about about 2~3GB of RAM, which seems to be too large, given that I only perform gaussian blurring and cutout… Do you believe this could be due to a memory leak issue?

@Kore_ana I did not code PyTorch itself - I have not seen the source code of PyTorch’s gaussian blurring and cutout library functions. I cannot answer your question.

What I do know is the following - current software development practices puts more emphasis on performance (quick execution time) over RAM usage. That means as long as RAM is available it will be used up to make applications faster. There is very little value in keeping unused RAM over slower RAM optimized applications.

Perhaps, @ptrblack can give an accurate and detailed answer about PyTorch’s memory management capabilities.

You might be re-initializing a CUDA context in each sub-process which could take a lot of memory if you are using CUDA<=11.6 or >=11.7 with disabled lazy context loading.

I see! thank you for the response! I’ll try to study lazy context loading and etc.