Memory Leaks while training a model

jerinphilip · November 16, 2018, 6:55am

Can fellow users share their methods of identifying and removing memory leaks. I’m not returning anything other than a scalar loss from my model/it’s trainer hoping the scoping with be enough to remove the unused variables but eventually the entire setup runs out of memory. This is a RAM oom rather than a GPU oom. I’m on torch 0.4.1.

ptrblck · November 16, 2018, 11:20am

Could you post a small code example to reproduce the error?
How are you returning the loss?

jerinphilip · December 4, 2018, 1:37pm

Hey, the codebase was a bit of a mess to isolate a small sample, but I believe the following seems to have done the trick.

         # Multiply with logprobs
         generator_objective = (advantages * log_probs).sum(dim=0)
-        return (generator_objective, cumulative_rewards)
+        return (generator_objective, cumulative_rewards.clone())

I don’t have a requirement to call .backward() on the cumulative_rewards, and that seems to be creating an issue. Can you tell what’s happening under the hood?

ptrblck · December 4, 2018, 1:42pm

I’m not sure, how cumulative_rewards is being calculated, but it might be the computation graph is attached to it. If you store this tensor somehow or keep it alive, all computation graphs will be stores as well.
You could call detach() on it to detach it from the computation graph so that it can be freed.