Thank you for your response.
Yes, I understand clearing out cache after restarting is not sensible as memory should ideally be deallocated.
But, if my model was able to train with a certain batch size for the past ‘n’ attempts, why does it stop doing so on my 'n+1’th attempt? I do not see how reducing the batch size would become a solution to this problem.
As I said, this happens very randomly.
Although my program is obviously able to detect GPU (as it says ‘“CUDA” out of memory’), I still wanted to check it programmatically. So, I inserted print(torch.cuda.is_available()) right before training began. And voila! It worked. No more CUDA out of memory error. Atleast for the time being. There is no logic behind why it started working now.
What can cause this as well is other programs using the same GPU. If you’re sharing the machine. Or have a screen connected to it. Or even other pytorch script that use the GPU by mistake.
I am accessing the machine through a Remote Desktop connection.
And apart from the main program, there are no Python scripts running in the background. Certainly there are softwares such as VSCode and multiple Google Chrome tabs which are open.
But this was the case even when the main program was running smoothly (i.e. without OOM error).
Hello, I have the same problem. I run torch.cuda.empty_cache() after last turn group of image finished traning then l start to traning a new group without restart the kernel but the gpu memory used is still getting bigger and bigger.
I described the problems in this topic.I wonder if you have some good suggestions,thanks! https://discuss.pytorch.org/c/memory-format/23
On Linux, sometimes you have an old process utilizing GPU. You can check these processes using nvidia-smi in the Terminal. Note the PID of any processes utilizing GPU and kill them using sudo kill <enter PID here>
I was about to ask a question but I found my issue. Maybe it will help others.
I was on Google Colab and finding that I could train my model several times, but that on the 3rd or 4th time I’d run into the memory error. Using torch.cuda.empty_cache() between runs did not help. All I could do was restart my kernel.
I had a setup of the sort:
class Fitter:
def __init__(self, model):
self.model = model
optimizer = # init optimizer here
The point is that I was carrying the model over in between runs but making a new optimizer (in my case I was making new instances of Fitter). And in my case, the (Adam) optimizer state actually took up more memory than my model!
So to fix it I tried some things.
This did not work:
def wipe_memory(self): # DOES NOT WORK
self.optimizer = None
torch.cuda.empty_cache()
Neither did this:
def wipe_memory(self): # DOES NOT WORK
del self.optimizer
self.optimizer = None
gc.collect()
torch.cuda.empty_cache()
This did work!
def wipe_memory(self): # DOES WORK
self._optimizer_to(torch.device('cpu'))
del self.optimizer
gc.collect()
torch.cuda.empty_cache()
def _optimizer_to(self, device):
for param in self.optimizer.state.values():
# Not sure there are any global tensors in the state dict
if isinstance(param, torch.Tensor):
param.data = param.data.to(device)
if param._grad is not None:
param._grad.data = param._grad.data.to(device)
elif isinstance(param, dict):
for subparam in param.values():
if isinstance(subparam, torch.Tensor):
subparam.data = subparam.data.to(device)
if subparam._grad is not None:
subparam._grad.data = subparam._grad.data.to(device)
is this mandatory in the training session? For instance I’ve built such function to do my training.
def train(...):
# Model: train
model.train()
# Load the data and convert to device
for (data, label) in loader:
...
# Refresh the gradients
optimizer.zero_grad(set_to_none=True)
# Calculate loss
loss = model.objective(x)
# Backprop
loss.backward()
# Optimizer step
optimizer.step()
Should I keep it as it is or am I supposed to have item of loss to be able to free some space in my gpu?
No this function looks good.
You should use .item() if you want to store the value of your loss in a list for further plotting/tracking (basically anything that would make it out-live the inner loop). Otherwise, you don’t need to worry about this.
one last thing I wonder is that would it cause any problem or contribute in freeing up cached memory if I do something like this;
# Training arrangements
...
# Backprop
loss.backward()
# Optimizer step
optimizer.step()
# Then, delete loss object
del loss
# and free cache
torch.cuda.empty_cache()
This will slow down your training (empty_cache is an expensive call). But otherwise, in 99.9% of the cases won’t do anything else.
Emptying the cache is already done if you’re about to run out of memory so there is no reason for you to do it by hand unless you have multiple processes using the same GPU and you want this process to free up space for the other process to use it. Which is a very very un-usual thing to do.
Based on your description it seems you are storing (unwanted) data and are thus increasing the memory usage until you are eventually running into an OOM error.
Freeing the cache will not avoid these error and besides slowing down your code will allow other processes to use the GPU memory.
@albanD can you clarify what this means? if loss.item() does not dellocate the gpu memory then does it keep the related graph in GPU memory and then return a float?
Is pytorch keeping track of activations in the graph in the model or only if the loss variable is in scope or referenced somewhere e.g. an object?
Yeah it is a bit overloaded. If you do value = loss.item(), it does not change anything to the loss function at that line. So it that sense will not “deallocate” the loss.
But it will still behave differently than tensor = loss.clone() for example. Clone will keep building the autograd graph and so tensor will potentially keep quite a bit of stuff alive due to that. This does not happen with .item() as it is returning a plain number and thus does not build the autograd graph.