Why does the reuse of training data cause a stack of "Non-releasable Memory in GPU"?

Nardien · February 14, 2020, 4:26am

Hi. I’ve used PyTorch 1.4 for my project.

I found the weird problem that the reuse of training data causes a stack of “Non-releasable Memory” in GPU.

Specifically, my program consists of several training processes.

Then, in every training process, a model, optimizer, and training data are initialized and trained.

However, since loading the training data using pickle causes too much time (each data instance is the class object consists of lists. It is a natural language dataset e.g. SQuAD v1.1), I try to load and ‘cache’ the training data before several training steps start rather than loading the training data at every training process.

Then, ‘cached’ training data is fed to the “train” function then it is wrapped with TensorDataset inside the “train” function.

However, by using cuda.memory_summary function, I found that this approach causes the stack of Non-releasable memory to GPU at every iteration and it ends up with a Cuda out of memory error.

When I load the training data or deep-copying training data in every training process, this problem did not happen but they cost too much time.

Does anybody have some ideas about why this problem occurs?

Furthermore, how can I solve this problem?

Thank you for reading my question.

ptrblck · February 14, 2020, 6:17am

Could you post a minimal code snippet to show how your caching function looks and works, so that we can reproduce this issue?

Nardien · February 14, 2020, 7:27am

Thank you for paying attention to my question.

Basically, our code is based on the following code:

github.com

huggingface/transformers/blob/master/examples/run_squad.py

# coding=utf-8
# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" Finetuning the library models for question-answering on SQuAD (DistilBERT, Bert, XLM, XLNet)."""


import argparse
import glob

This file has been truncated. show original

Then, in our code, we call the main function of run_squad.py every iteration as follows:

from run_squad import main
...
for _ in range(100):
    main()

(this is not exact code since we slightly modified it that we can call this function iteratively, however, I think you can catch what we try to do by this code snippet)

Since this code loads the training dataset at every iteration like this (line 791),

        train_dataset = load_and_cache_examples(args, tokenizer, evaluate=False, output_examples=False)

it costs 15~30 seconds at every iteration for loading dataset.

Instead, we try to load the dataset before iteration instead of loading the dataset at every iteration as follows:

from run_squad import main, load_and_cache_examples
...
train_dataset = load_and_cache_examples(...)
for _ in range(100):
    main(train_dataset)

However, it ends up with a stack of “Non-releasable Memory in GPU” and finally the out of memory error occurs because of stacked non-releasable GPU memory.

This situation did not occur if we load the dataset inside every main function call.

I’m sorry for not providing the detailed code. If you need more information on inspecting this problem, please let me know.

Thanks.

Molaison_zeng · November 24, 2024, 3:17am

Hi, I encountered the exactly same problem. I’ve wrapped the train procedures into a function to keep the data destroyed outside of function calls. I even use following codes to clean up memory, but it seems not working:

del loss_input
del loss
gc.collect()
torch.cuda.empty_cache()
torch._C._cuda_clearCublasWorkspaces()
torch._dynamo.reset()

Have you found any solutions?