Hi! I am trying to make data augmentations (self.transform
, self.transform_prime
in the code below) to be done on the GPU to reduce computation time. However, I am running into memory errors.
Below is a snippet of the dataset class code.
Below, sub_img
is a NumPy tensor inside the RAM. As the code below shows, I tried to make them into gpu torch tensor, do transformations on them (inside the GPU), then return them.
However, when I made a data loader using the dataset and ran it, CUDA out of memory occurred, even for batch sizes of 2, which is weird, since it worked for batch sizes of 35, when I ran the version of the code that does augmentations inside of the CPU. (Also, the nvidia-smi shows that lots of processes (that take up VRAM) are created, before CUDA out of memory occurs)
def __getitem__(self,idx):
sub_data = self.dataset[idx]
sub_img, sub_label = self.dataset[idx] #해당 idx subject의 img뽑기
if self.split == 'train':
"""below : major revision, so check again (copy 안해도?)"""
y1 = self.transform(from_numpy(sub_img).float().to("cuda:0"))
y2 = self.transform_prime(from_numpy(sub_img).float().to("cuda:0"))
return (y1, y2), sub_label
Could anyone explain to me how I can fix this? The questions are :
- why does the CUDA out of memory occur when doing
__getitem__
? If my understanding is correct,__getitem__
is used when batches are generated in dataloader, and therefore should get removed when that specific batch is not used anymore. Shouldn’t this mean that the gpu memory used when I moved thesub_img
toCUDA:0
be removed after each batch and hence not take up lots of memory? Why is there a GPU memory error? - How can I fix this? Should I make it so that the
self.transform
itself gets NumPy arrays, but withinself.transform
function it converts the arrays to tensors to perform operations in the GPU then return the tensor back to the CPU? Wouldn’t this be inefficient since the tensor has to move back and forth between the CPU and GPU?
I am sorry for my novice questions… thank you for any help and suggestions
I have attached the error log below :
Traceback (most recent call last):
File "main_3D.py", line 371, in <module>
main()
File "main_3D.py", line 87, in main
torch.multiprocessing.spawn(main_worker, (args,), args.ngpus_per_node)
File "/home/connectome/dyhan316/.local/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/home/connectome/dyhan316/.local/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
while not context.join():
File "/home/connectome/dyhan316/.local/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 160, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:
-- Process 1 terminated with the following error:
Traceback (most recent call last):
File "/home/connectome/dyhan316/.local/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
fn(i, *args)
File "/scratch/connectome/dyhan316/VAE_ADHD/barlowtwins/main_3D.py", line 151, in main_worker
for step, ((y1, y2), _) in enumerate(loader, start=epoch * len(loader)):
File "/home/connectome/dyhan316/.local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 530, in __next__
data = self._next_data()
File "/home/connectome/dyhan316/.local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1224, in _next_data
return self._process_data(data)
File "/home/connectome/dyhan316/.local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1250, in _process_data
data.reraise()
File "/home/connectome/dyhan316/.local/lib/python3.8/site-packages/torch/_utils.py", line 457, in reraise
raise exception
RuntimeError: Caught RuntimeError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/home/connectome/dyhan316/.local/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 287, in _worker_loop
data = fetcher.fetch(index)
File "/home/connectome/dyhan316/.local/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 49, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/connectome/dyhan316/.local/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 49, in <listcomp>
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/scratch/connectome/dyhan316/VAE_ADHD/barlowtwins/dataset.py", line 81, in __getitem__
y1 = self.transform(from_numpy(sub_img).float().to("cuda:0"))
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.