How to ensure GPU memory is freed at the end of a program?


I’m using a bash script to do grid-search for my pytorch code.
Something like as follows:

for lr in 0.1 0.01 0.001
    for wd in 0.1 0.01 0.001
       # call pytorch script for the current setting of hyperparams
       python --lr=$lr --wd=$wd

I noticed that, sometimes, a process finishes without freeing the GPU memory. It eventually causes the GPU to go OOM even though I execute each program sequentially.

How can I fix this?

maybe some gpu cache memory is not emptied? check torch.cuda.empty_cache()

Was each script execution correctly exited or did you see some errors?
If the GPU memory is not released after finishing the execution, some zombie processes might be alive.
Could you check it via htop or ps?

Hi @ptrblck

Indeed there seems to be zombie processes. Btw, previously called scripts run fine however, latter ones yield an error due to GPU running OOM.