Unable to allocate cuda memory, when there is enough of cached memory

If you are sure that you don’t need the process, you could try to kill it, but please make sure it’s not a valid process.

torch.cuda.empty_cache() shouldn’t help, as it would only empty the CUDA memory cache, which would then trigger expensive cudaMalloc calls and would thus slow down your code.

It just stops the code what I’m running, right? It will not change my code or delete my initial dataset.

Could you please tell me detailedly on how to release my memory to avoid such issue before every time I run a new project? eg. nvidia-smi

It depends on the process you are stopping. If the GPU is used to visualize your desktop, this process might be needed unless you are working on a server etc. Killing a process (especially with -9) won’t might result in data losses, as the process might not have a chance to save it’s work and you should be careful using it.

nvidia-smi will show you the used memory on the device (and the processes using the memory, if possible). If you don’t need these processes, you could close them to save memory, but it depends on your system and the used processes.

Hi i got a similar problem - i fresh restarted the pc and was only able to allocate 3 GB of my 8 GB Nvidia Gpu.

So for me it worked to kick out everything in the autorun in windows - something like steam java ect.

also having an ubuntu pc would probably work well :smiley: -never had these kind of problems with my ubuntu pc

I solved this problem by increasing bach size

@ptrblck @smth ,i am working with 2 3090 GPUs,then also I don’t know why its showing OOM error,an nvidia-smi shows this!

This is the exact error:RuntimeError: CUDA out of memory. Tried to allocate 4.00 MiB (GPU 0; 23.70 GiB total capacity; 18.06 GiB already allocated; 5.56 MiB free; 838.00 KiB cached)

By the way this is shown when the training terminated at the Runtime-error

Could you post a minimal, executable code snippet, which would reproduce the issue, i.e. running out of memory while the the GPU still has enough memory pages, please?

Sorry I cannot share the exact code,but somewhere in torch.autograd i had used retain_graph=True ,will that affect it?

@ptrblck ,please check


Inspite of the process being terminated nearly 4GB of memory on each GPU is occupied

Yes, it can affect it, as you might be increasing the memory usage in each iteration by keeping the computation graph.

@ptrblck ,but without using retain_graph was giving None for the grad of some variables

Btw,it ran OOM before even running for one iteration

I restarted the training by kill all PIDS which were occupying GPU Memory but it didn’t help

Experiment dir : search-EXP-ab1–20211023-041336
10/23 04:13:36 AM gpu device = 0,1
10/23 04:13:36 AM args = Namespace(arch_learning_rate=0.0003, arch_weight_decay=0.001, batch_size=8, cutout=False, cutout_length=16, data=’/voyager-volume/code_1_test/Original_images’, drop_path_prob=0.3, epochs=50, gpu=‘0,1’, grad_clip=5, init_channels=16, is_parallel=1, layers=8, learning_rate=0.025, learning_rate_feature_extractor=0.025, learning_rate_head_g=0.025, learning_rate_min=0.001, model_path=‘saved_models’, momentum=0.9, num_classes=31, report_freq=50, save=‘search-EXP-ab1–20211023-041336’, seed=2, source=‘amazon’, target=‘dslr’, train_portion=0.5, unrolled=False, weight_decay=0.0003, weight_decay_fe=0.0003, weight_decay_hg=0.0003)

10/23 04:55:44 AM param size = 0.297522MB
10/23 04:55:44 AM epoch 0 lr 2.500000e-02
10/23 04:55:44 AM genotype = Genotype(normal=[(‘max_pool_3x3’, 1), (‘skip_connect’, 0), (‘max_pool_3x3’, 0), (‘skip_connect’, 2), (‘dil_conv_5x5’, 2), (‘dil_conv_5x5’, 0), (‘sep_conv_3x3’, 1), (‘dil_conv_3x3’, 4)], normal_concat=range(2, 6), reduce=[(‘avg_pool_3x3’, 1), (‘sep_conv_3x3’, 0), (‘dil_conv_3x3’, 1), (‘dil_conv_3x3’, 0), (‘skip_connect’, 2), (‘max_pool_3x3’, 1), (‘max_pool_3x3’, 3), (‘dil_conv_3x3’, 2)], reduce_concat=range(2, 6))
tensor([[0.1250, 0.1250, 0.1247, 0.1251, 0.1250, 0.1251, 0.1250, 0.1251],
[0.1250, 0.1252, 0.1249, 0.1250, 0.1251, 0.1251, 0.1248, 0.1249],
[0.1249, 0.1253, 0.1250, 0.1249, 0.1249, 0.1250, 0.1250, 0.1249],
[0.1251, 0.1249, 0.1250, 0.1251, 0.1250, 0.1249, 0.1250, 0.1251],
[0.1249, 0.1247, 0.1250, 0.1252, 0.1249, 0.1251, 0.1251, 0.1250],
[0.1249, 0.1250, 0.1250, 0.1251, 0.1250, 0.1250, 0.1248, 0.1253],
[0.1251, 0.1250, 0.1250, 0.1251, 0.1248, 0.1251, 0.1250, 0.1249],
[0.1252, 0.1250, 0.1249, 0.1249, 0.1250, 0.1249, 0.1251, 0.1251],
[0.1250, 0.1250, 0.1251, 0.1251, 0.1251, 0.1250, 0.1249, 0.1248],
[0.1250, 0.1252, 0.1249, 0.1250, 0.1251, 0.1249, 0.1248, 0.1251],
[0.1249, 0.1249, 0.1250, 0.1250, 0.1252, 0.1250, 0.1250, 0.1250],
[0.1251, 0.1249, 0.1249, 0.1250, 0.1249, 0.1252, 0.1251, 0.1251],
[0.1250, 0.1251, 0.1251, 0.1250, 0.1250, 0.1251, 0.1249, 0.1249],
[0.1251, 0.1247, 0.1249, 0.1251, 0.1252, 0.1249, 0.1253, 0.1249]],
device=‘cuda:0’, grad_fn=)
tensor([[0.1252, 0.1251, 0.1250, 0.1249, 0.1251, 0.1249, 0.1249, 0.1250],
[0.1249, 0.1248, 0.1251, 0.1250, 0.1250, 0.1251, 0.1251, 0.1250],
[0.1251, 0.1249, 0.1249, 0.1251, 0.1249, 0.1250, 0.1251, 0.1250],
[0.1251, 0.1249, 0.1249, 0.1250, 0.1250, 0.1249, 0.1251, 0.1251],
[0.1249, 0.1251, 0.1248, 0.1250, 0.1250, 0.1250, 0.1251, 0.1251],
[0.1251, 0.1248, 0.1251, 0.1251, 0.1250, 0.1249, 0.1250, 0.1250],
[0.1250, 0.1251, 0.1250, 0.1251, 0.1249, 0.1249, 0.1250, 0.1249],
[0.1249, 0.1251, 0.1251, 0.1252, 0.1249, 0.1251, 0.1248, 0.1248],
[0.1252, 0.1250, 0.1250, 0.1251, 0.1247, 0.1249, 0.1252, 0.1250],
[0.1251, 0.1247, 0.1250, 0.1251, 0.1249, 0.1251, 0.1250, 0.1252],
[0.1250, 0.1249, 0.1249, 0.1251, 0.1250, 0.1252, 0.1250, 0.1249],
[0.1249, 0.1251, 0.1249, 0.1250, 0.1251, 0.1250, 0.1251, 0.1249],
[0.1250, 0.1252, 0.1247, 0.1247, 0.1249, 0.1252, 0.1250, 0.1251],
[0.1250, 0.1252, 0.1250, 0.1251, 0.1250, 0.1249, 0.1248, 0.1250]],
device=‘cuda:0’, grad_fn=)
/opt/conda/lib/python3.6/site-packages/torch/tensor.py:292: UserWarning: non-inplace resize_as is deprecated
warnings.warn(“non-inplace resize_as is deprecated”)
Traceback (most recent call last):
File “train.py”, line 363, in
main()
File “train.py”, line 181, in main
train_acc, train_obj = train(source_train_loader,source_val_loader,target_train_loader,target_val_loader, criterion,optimizer,optimizer_fe, optimizer_hg,lr,feature_extractor,head_g,model,architect,args.batch_size)
File “train.py”, line 219, in train
_,domain_logits=model(input_img_source)
File “/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py”, line 489, in call
result = self.forward(*input, **kwargs)
File “/voyager-volume/code_1_test/code_1_test/code_1_test/code_1_test/code_1_test/model_search.py”, line 159, in forward
s0, s1 = s1, cell(s0, s1, weights,weights2)
File “/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py”, line 489, in call
result = self.forward(*input, **kwargs)
File “/voyager-volume/code_1_test/code_1_test/code_1_test/code_1_test/code_1_test/model_search.py”, line 85, in forward
s = sum(weights2[offset+j]*self._ops[offset+j](h, weights[offset+j]) for j, h in enumerate(states))
File “/voyager-volume/code_1_test/code_1_test/code_1_test/code_1_test/code_1_test/model_search.py”, line 85, in
s = sum(weights2[offset+j]*self._ops[offset+j](h, weights[offset+j]) for j, h in enumerate(states))
File “/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py”, line 489, in call
result = self.forward(*input, **kwargs)
File “/voyager-volume/code_1_test/code_1_test/code_1_test/code_1_test/code_1_test/model_search.py”, line 44, in forward
temp1 = sum(w * op(xtemp) for w, op in zip(weights, self._ops))
RuntimeError: CUDA out of memory. Tried to allocate 2.00 MiB (GPU 0; 23.70 GiB total capacity; 22.83 GiB already allocated; 2.56 MiB free; 523.00 KiB cached)

As the error message explains, you have already allocated almost all of the GPU memory and won’t be able to allocate more.
Without a code snippet for debugging I’m not able to debug anything further.
Since you cannot share the code, try to check the memory usage in your script at different places and check which code snippets allocate this (apparently unexpected) memory.