How to clear CPU memory after training (no CUDA)

Roj · January 5, 2021, 1:01pm

I’ve seen several threads (here and elsewhere) discussing similar memory issues on GPUs, but none when running PyTorch on CPUs (no CUDA), so hopefully this isn’t too repetitive.

In a nutshell, I want to train several different models in order to compare their performance, but I cannot run more than 2-3 on my machine without the kernel crashing for lack of RAM (top shows it dropping from several GB to ~10MB). Obviously, I could write the output to a file, and then restart the kernel before starting the next model, but this is highly inelegant (no automation, requires repeated data pre-processing, etc). Hence what I’d like to do is clear/delete each model after training without killing the kernel, in order to make room for the next one.

For example, say I want to run five models with different numbers of layers and fixed input/output dimensions, using some pre-selected loss function (loss_func); the relevant code snippet looks like this:

# construct list of models and associated optimizers:
depth = np.arange(10, 20, 2)
models = []
opt = [] 
for i,d in enumerate(depth):
    models.append(build_network(d, input_dim, output_dim))    # function that returns nn.Sequential
    opt.append(optim.SGD(models[i].parameters(), learning_rate, momentum)    # torch.optim

# train each model, storing losses and accuracies:
loss = []
acc = []
for i in range(len(models)):
    loss.append([])
    acc.append([])
    fit(epochs, model[i], loss_func, opt[i], train_dl, valid_dl, loss[i], acc[i])    # train & evaluate models

Training itself runs as expected (so any syntax errors above are merely typos, not present in the original code), but available RAM drops to nothing by the third loop if I run for more than about 5 epochs. So, what I want to do is free-up the RAM by deleting each model (or the gradients, or whatever’s eating all that memory) before the next loop. Scattered results across various forums suggested adding, directly below the call to fit() in the loop,

models[i] = 0
opt[i] = 0
gc.collect()    # garbage collection

or

del.models[i]
del.opt[i]
gc.collect()

neither of which had any effect on available RAM (I would have expected it to jump back up to several GB between loops). Reference count in the latter case was also unchanged.

Is there a “proper” way to free-up memory after each model is trained, without having to restart the kernel? (Again, I’m running on CPU, but if there’s an elegant method that works for both CPU and GPU, that would be nice too).

googlebot · January 5, 2021, 3:34pm

try wrapping a single model run in a function, so that all local variables could be released, and doing gc.collect() outside.

Roj · January 7, 2021, 12:41pm

I’m not sure I understand your suggestion; the individual model runs are already wrapped in the fit() function. Do you mean the entire process of constructing the models/optimizers/etc should be wrapped in some function call? That would seem to obfuscate the code a bit, given the amount of prep before training.

googlebot · January 7, 2021, 2:45pm

Well, that worked for me in a similar situation. And was easier than tracking non-
collectable tensors.

one other possibility here is that you have deleted all tensors, but heap size (=used memory as seen from outside) remained unchanged, i.e. buffered for further allocations.

Roj · January 7, 2021, 4:53pm

Well, that worked for me in a similar situation. And was easier than tracking non-
collectable tensors.

I’m already wrapping the individual training calls in a function; could you post an example of what you did differently?

one other possibility here is that you have deleted all tensors, but heap size (=used memory as seen from outside) remained unchanged, i.e. buffered for further allocations.

I considered this, but the kernel still crashes, so even if top is misleading me, it seems the memory is still not available.

googlebot · January 7, 2021, 5:32pm

I have lots of wrappers and don’t use global variables. Runner looks like:

class Runner:
  def inner(...):
    datasource = create_datasource(...)
    trainable = create_trainable(datasource,...)
    return trainable.run(...)

  def run():
    try:
      r=self.inner(...)
    except Exception as ex:
      import traceback
      try:
        traceback.clear_frames(sys.last_traceback)
      except:
        pass
      raise ex
    gc.collect()
    torch.cuda.empty_cache()
    self.logger.debug("cuda_allocated: %d Mb", torch.cuda.memory_allocated() // (1 << 20))

“trainable” encapsulates everything: torch module, optimizer, train loop (ignite).

Roj · January 9, 2021, 11:47am

Thanks for the example, but I explicitly said that I’m running on CPU, not GPU, so I don’t see how the calls to torch.cuda will affect anything. Is there an equivalent call to clear the CPU cache (assuming, quite possibly incorrectly, that this is what I need)? Because otherwise, this seems to be qualitatively the same as my original scenario: again, training is already encapsulated in a function call, followed by gc.collect(). The memory just isn’t cleared afterwards.

Roj · January 9, 2021, 1:56pm

Update: turns out the heavy memory usage is due to the fact that I’m storing forward hooks for each model (in order to run some analysis); obvious, in retrospect. So the question becomes, more specifically, how to properly delete/clear hooks after I no longer need them.

In my case, I’m simply storing them as (arrays of) arrays (more efficient suggestions welcome!), but trying to free the memory by using either hooks[i] = 0 or del hooks[i], followed by gc.collect(), as in the examples above, still fails to do the trick.

googlebot · January 10, 2021, 11:06pm

I left cuda lines just to illustrate that I have no leftover tensors at that point, and re-run acts as if process was restarted (except explicitly stored stuff of course)
Regarding hooks, be sure to .detach() recorded tensors, and note that their installers return “handles” to invoke .remove() on. I don’t use them though, so not sure if they have any additional inherent problems…