I working on a huge and complicated pytorch model to solve PDEs using torch geometric graphs and so on, I do computations on GPU RTX3090, but when I lunch the training I have memory error , I start searching, one of the thing I remarked is that when calculating error if I store it as a scalr instead of a tensor I use much less memory, but I convert it to a tensor just before calling backward() , but the problem is that in this case the training in starting but with no change from an epoch to an other.
here is the training loop:
an the epochs:
Is there a way to resolve this, remmeber my forst problem is cuda memory error.
You are detaching the losses from the computation graph by wrapping them into a new tensor.
requires_grad = True on this new tensor will then hide the expected error message.
You can verify it by checking the
.grad attributes of all parameters after the
backward() op and would see that they are
To fix this sum the losses directly without using
PS: you can post code snippets by wrapping them into three backticks ```, which would make debugging easier.
By doing .item() to err_fe and err_eq I just gained a lot of memory and the cuda memory error just disappeared, is there a way to keep use item()?
The code is really big ( more thank 400 line just for the model) and also it’s not my code I can’t publish it, I hope that you can help me otherwise; I can provide more precisions as much as needed.
By doing so cuda memory error message appears then before the training starts.
It seems you might be running out of memory by fixing the issues, i.e. by avoiding to detach the tensor.
You also cannot use
item(), as it would also detach the result from the computation graph and would return a plain Python scalar.
To lower the memory usage you could e.g. decrease the batch size etc.