CUDA out of memory error when calling the train script in a loop from another python script

theairbend3r · April 9, 2021, 7:04am

Hi all,

I’m trying to run the same pytorch training script with different arguments (argparse) from another python script. I’m using os.system() for the same.

Here’s what I’m trying to do -
train.py = > the script which contains the train-loop.
runner.py => the file which runs the train script in a loop.

# runner.py
for hp in hyperparams:
    os.system(f"CUDA_VISIBLE_DEVICES=1 python train.py --arg1 hp")

A few models get trained but I eventually end up getting a CUDA out of memory error. My guess is that the GPU memory is not being cleared after every loop. What can I do to mitigate this?

god_sp33d · April 11, 2021, 4:25am

Does it train completely for the first set of hyperparams in the for loop?

theairbend3r · April 11, 2021, 6:00am

Yep. For instance, if there were 10 models, it will successfully train 8 and then give a CUDA error for 9 and 10.