Hi, I have a large-size network that is out of the memory of one GPU. So I put different layers into different GPUs. This has fixed the 'out of memory error when loading the model. However, the error still shows in the step of backpropagation.
Traceback (most recent call last):
File "/scratch/project_2005641/THz_DNN/THz_Huge.py", line 379, in <module>
Loss_cache, Lr_list = train_model()
File "/scratch/project_2005641/THz_DNN/THz_Huge.py", line 127, in train_model
loss.backward() # backpropagation
File "/usr/local/lib64/python3.9/site-packages/torch/_tensor.py", line 396, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/usr/local/lib64/python3.9/site-packages/torch/autograd/__init__.py", line 173, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: CUDA out of memory. Tried to allocate 16.00 GiB (GPU 2; 31.75 GiB total capacity; 16.01 GiB already allocated; 14.78 GiB free; 16.02 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Any solution to fix this?