Error Handling Runs Loop

Hello all,

I was curious – for hyperparameter search which sometimes leads to issues like CUDA out of memory due to batch_size and/or image size, related parameters is this code structure an okay practice? If there is a better way of handling this I would love to hear comments!

for run_count, hyperparams in enumerate(HYPERPARAMS, start=1):

    # define model, dataloaders, etc
    try:
        #Call Training Utils     

    except RuntimeError as e:
       # Handle Possible Errors
        with open("run_parameters.txt", "a+") as text_file:
            text_file.write("*** Runtime Error: {} \n\n".format(e))            
        
    finally:
        #Rest For Next Run
        del model
        del optimizer_ft
        del exp_lr_scheduler
        del dataloaders
        gc.collect()
        torch.cuda.empty_cache()

Hi,

If you’re designing this from scratch, I would advise to create new processes to run the inner job to ensure no issue with cleanup of remaining states.
You can check this issue for issues related to GPU OOM failures: https://github.com/pytorch/pytorch/issues/18853.
Also be aware that if an assert happens in a cuda kernel (like index out of range), the cuda driver cannot recover from it and you will have to restart the process. So your code example won’t help in that case as all the other trainings will fail.

1 Like

Thanks! Makes sense. I appreciate the advice.

Hey, did u get any fix for it? I’m a beginner in PyTorch, please explain