Loss function dosen't change from epoch to an other

el_youssfi_azeddine · July 29, 2021, 9:57am

Hello,
I working on a huge and complicated pytorch model to solve PDEs using torch geometric graphs and so on, I do computations on GPU RTX3090, but when I lunch the training I have memory error , I start searching, one of the thing I remarked is that when calculating error if I store it as a scalr instead of a tensor I use much less memory, but I convert it to a tensor just before calling backward() , but the problem is that in this case the training in starting but with no change from an epoch to an other.
here is the training loop:
an the epochs:
Is there a way to resolve this, remmeber my forst problem is cuda memory error.
thanks

ptrblck · July 30, 2021, 4:26am

You are detaching the losses from the computation graph by wrapping them into a new tensor.
Setting requires_grad = True on this new tensor will then hide the expected error message.
You can verify it by checking the .grad attributes of all parameters after the backward() op and would see that they are None.
To fix this sum the losses directly without using torch.tensor(err...).

PS: you can post code snippets by wrapping them into three backticks ```, which would make debugging easier.

el_youssfi_azeddine · July 30, 2021, 6:12am

By doing .item() to err_fe and err_eq I just gained a lot of memory and the cuda memory error just disappeared, is there a way to keep use item()?
The code is really big ( more thank 400 line just for the model) and also it’s not my code I can’t publish it, I hope that you can help me otherwise; I can provide more precisions as much as needed.
thanks

el_youssfi_azeddine · July 30, 2021, 6:27am

By doing so cuda memory error message appears then before the training starts.

ptrblck · July 30, 2021, 7:58am

It seems you might be running out of memory by fixing the issues, i.e. by avoiding to detach the tensor.
You also cannot use item(), as it would also detach the result from the computation graph and would return a plain Python scalar.
To lower the memory usage you could e.g. decrease the batch size etc.