Why does the reuse of training data cause a stack of "Non-releasable Memory in GPU"?

Hi. I’ve used PyTorch 1.4 for my project.

I found the weird problem that the reuse of training data causes a stack of “Non-releasable Memory” in GPU.

Specifically, my program consists of several training processes.

Then, in every training process, a model, optimizer, and training data are initialized and trained.

However, since loading the training data using pickle causes too much time (each data instance is the class object consists of lists. It is a natural language dataset e.g. SQuAD v1.1), I try to load and ‘cache’ the training data before several training steps start rather than loading the training data at every training process.

Then, ‘cached’ training data is fed to the “train” function then it is wrapped with TensorDataset inside the “train” function.

However, by using cuda.memory_summary function, I found that this approach causes the stack of Non-releasable memory to GPU at every iteration and it ends up with a Cuda out of memory error.

When I load the training data or deep-copying training data in every training process, this problem did not happen but they cost too much time.

Does anybody have some ideas about why this problem occurs?

Furthermore, how can I solve this problem?

Thank you for reading my question.

Could you post a minimal code snippet to show how your caching function looks and works, so that we can reproduce this issue?

Thank you for paying attention to my question.

Basically, our code is based on the following code:

Then, in our code, we call the main function of run_squad.py every iteration as follows:

from run_squad import main
for _ in range(100):

(this is not exact code since we slightly modified it that we can call this function iteratively, however, I think you can catch what we try to do by this code snippet)

Since this code loads the training dataset at every iteration like this (line 791),

        train_dataset = load_and_cache_examples(args, tokenizer, evaluate=False, output_examples=False)

it costs 15~30 seconds at every iteration for loading dataset.

Instead, we try to load the dataset before iteration instead of loading the dataset at every iteration as follows:

from run_squad import main, load_and_cache_examples
train_dataset = load_and_cache_examples(...)
for _ in range(100):

However, it ends up with a stack of “Non-releasable Memory in GPU” and finally the out of memory error occurs because of stacked non-releasable GPU memory.

This situation did not occur if we load the dataset inside every main function call.

I’m sorry for not providing the detailed code. If you need more information on inspecting this problem, please let me know.