CPU RAM growing during each epoch

Dawid_S · August 31, 2018, 3:56pm

After the first epoch training, I am using del and gc.collect() on DataLoader, model, optimizer, everything except paths, and config, but the RAM is 6GB occupied after the first epoch. Then I am loading the model and creating dev and test DataLoaders to evaluate the model (7GB after), and then I delete test and dev DataLoaders, and RAM falls from 7GB to 6 GB. Then again it is growing during second epoch training and eventually, RAM is full.

I can’t upload the code (nevertheless it’s a quite huge project).

I’m sure no append() is used anywhere in my code (even if, it should be freed after gc.collect()?). I’m using .item() to get loss value.

Where can I look for the problems?

ptrblck · August 31, 2018, 4:20pm

gc.collect() shouldn’t be necessary to avoid running out of memory.
Could you check your data loading functions, if unnecessary data samples are stored somehow?

Are you saving the model parameters somewhere, e.g. to restore the model for the best epoch?

Dawid_S · August 31, 2018, 6:30pm

I am saving the model to the disk every epoch. I am using python 2.7 so I’m doing it in the following way.

model_n_b = io.BytesIO()
torch.save(model.state_dict(), model_n_b)
with open(model_name, "wb") as f:
    f.write(model_n_b.getvalue())
    model_n_b.close()

which I’m not sure is correct, however, without saving (and without later deleting and loading the model object) the leak still occurs.

If I run only data loading, like

for batch in loader:
    pass
    # loss = model.loss()
    ...

there’s no leak

ptrblck · August 31, 2018, 9:45pm

Thanks for the info.

Does the memory increase for the full training loop?
If so, I would try to debug where it occurs and maybe you can narrow it down to a specific line, e.g. the criterion or backward pass.
Then maybe you could create a small executable code snippet and we could have a look.