I’ve seen several threads (here and elsewhere) discussing similar memory issues on GPUs, but none when running PyTorch on CPUs (no CUDA), so hopefully this isn’t too repetitive.
In a nutshell, I want to train several different models in order to compare their performance, but I cannot run more than 2-3 on my machine without the kernel crashing for lack of RAM (top shows it dropping from several GB to ~10MB). Obviously, I could write the output to a file, and then restart the kernel before starting the next model, but this is highly inelegant (no automation, requires repeated data pre-processing, etc). Hence what I’d like to do is clear/delete each model after training without killing the kernel, in order to make room for the next one.
For example, say I want to run five models with different numbers of layers and fixed input/output dimensions, using some pre-selected loss function (loss_func); the relevant code snippet looks like this:
# construct list of models and associated optimizers:
depth = np.arange(10, 20, 2)
models = []
opt = []
for i,d in enumerate(depth):
models.append(build_network(d, input_dim, output_dim)) # function that returns nn.Sequential
opt.append(optim.SGD(models[i].parameters(), learning_rate, momentum) # torch.optim
# train each model, storing losses and accuracies:
loss = []
acc = []
for i in range(len(models)):
loss.append([])
acc.append([])
fit(epochs, model[i], loss_func, opt[i], train_dl, valid_dl, loss[i], acc[i]) # train & evaluate models
Training itself runs as expected (so any syntax errors above are merely typos, not present in the original code), but available RAM drops to nothing by the third loop if I run for more than about 5 epochs. So, what I want to do is free-up the RAM by deleting each model (or the gradients, or whatever’s eating all that memory) before the next loop. Scattered results across various forums suggested adding, directly below the call to fit() in the loop,
models[i] = 0
opt[i] = 0
gc.collect() # garbage collection
or
del.models[i]
del.opt[i]
gc.collect()
neither of which had any effect on available RAM (I would have expected it to jump back up to several GB between loops). Reference count in the latter case was also unchanged.
Is there a “proper” way to free-up memory after each model is trained, without having to restart the kernel? (Again, I’m running on CPU, but if there’s an elegant method that works for both CPU and GPU, that would be nice too).