CUDA out of memory when doing GPU augmentations in `__getitem__` of custom dataset

Hi! I am trying to make data augmentations (self.transform, self.transform_prime in the code below) to be done on the GPU to reduce computation time. However, I am running into memory errors.

Below is a snippet of the dataset class code.

Below, sub_img is a NumPy tensor inside the RAM. As the code below shows, I tried to make them into gpu torch tensor, do transformations on them (inside the GPU), then return them.

However, when I made a data loader using the dataset and ran it, CUDA out of memory occurred, even for batch sizes of 2, which is weird, since it worked for batch sizes of 35, when I ran the version of the code that does augmentations inside of the CPU. (Also, the nvidia-smi shows that lots of processes (that take up VRAM) are created, before CUDA out of memory occurs)

    def __getitem__(self,idx):
        sub_data = self.dataset[idx]
        sub_img, sub_label = self.dataset[idx] #해당 idx subject의 img뽑기
        if self.split == 'train':
            """below : major revision, so check again (copy 안해도?)"""            
            y1 = self.transform(from_numpy(sub_img).float().to("cuda:0"))
            y2 = self.transform_prime(from_numpy(sub_img).float().to("cuda:0"))
            return (y1, y2), sub_label

Could anyone explain to me how I can fix this? The questions are :

  • why does the CUDA out of memory occur when doing __getitem__ ? If my understanding is correct, __getitem__ is used when batches are generated in dataloader, and therefore should get removed when that specific batch is not used anymore. Shouldn’t this mean that the gpu memory used when I moved the sub_img to CUDA:0 be removed after each batch and hence not take up lots of memory? Why is there a GPU memory error?
  • How can I fix this? Should I make it so that the self.transform itself gets NumPy arrays, but within self.transform function it converts the arrays to tensors to perform operations in the GPU then return the tensor back to the CPU? Wouldn’t this be inefficient since the tensor has to move back and forth between the CPU and GPU?

I am sorry for my novice questions… thank you for any help and suggestions :slight_smile:

I have attached the error log below :

Traceback (most recent call last):
  File "", line 371, in <module>
  File "", line 87, in main
    torch.multiprocessing.spawn(main_worker, (args,), args.ngpus_per_node)
  File "/home/connectome/dyhan316/.local/lib/python3.8/site-packages/torch/multiprocessing/", line 240, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/home/connectome/dyhan316/.local/lib/python3.8/site-packages/torch/multiprocessing/", line 198, in start_processes
    while not context.join():
  File "/home/connectome/dyhan316/.local/lib/python3.8/site-packages/torch/multiprocessing/", line 160, in join
    raise ProcessRaisedException(msg, error_index,

-- Process 1 terminated with the following error:
Traceback (most recent call last):
  File "/home/connectome/dyhan316/.local/lib/python3.8/site-packages/torch/multiprocessing/", line 69, in _wrap
    fn(i, *args)
  File "/scratch/connectome/dyhan316/VAE_ADHD/barlowtwins/", line 151, in main_worker
    for step, ((y1, y2), _) in enumerate(loader, start=epoch * len(loader)):
  File "/home/connectome/dyhan316/.local/lib/python3.8/site-packages/torch/utils/data/", line 530, in __next__
    data = self._next_data()
  File "/home/connectome/dyhan316/.local/lib/python3.8/site-packages/torch/utils/data/", line 1224, in _next_data
    return self._process_data(data)
  File "/home/connectome/dyhan316/.local/lib/python3.8/site-packages/torch/utils/data/", line 1250, in _process_data
  File "/home/connectome/dyhan316/.local/lib/python3.8/site-packages/torch/", line 457, in reraise
    raise exception
RuntimeError: Caught RuntimeError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/home/connectome/dyhan316/.local/lib/python3.8/site-packages/torch/utils/data/_utils/", line 287, in _worker_loop
    data = fetcher.fetch(index)
  File "/home/connectome/dyhan316/.local/lib/python3.8/site-packages/torch/utils/data/_utils/", line 49, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/connectome/dyhan316/.local/lib/python3.8/site-packages/torch/utils/data/_utils/", line 49, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/scratch/connectome/dyhan316/VAE_ADHD/barlowtwins/", line 81, in __getitem__
    y1 = self.transform(from_numpy(sub_img).float().to("cuda:0"))
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.