How to iteratively remove non-optimal directories during HPO to save space

Slowat_Kela · January 11, 2023, 9:47am

Hello,

Could I ask,

When I run pytorch lightning with HPO it creates a lot of directories like this:

'train_fn_ff307114_1350_activation_function=nn_ReLU_inplace_True,c_hidden=64,dp_rate_linear=0.6747,learning_rate=0.0002,num_layers=_2022-12-21_17-51-28'
'train_fn_ff4d9f16_599_activation_function=nn_LeakyReLU_inplace_True,c_hidden=2056,dp_rate_linear=0.5871,learning_rate=0.0035,num_l_2022-12-21_13-26-00'
'train_fn_ffa6c2a4_892_activation_function=nn_ReLU_inplace_True,c_hidden=512,dp_rate_linear=0.6173,learning_rate=0.0116,num_layers=_2022-12-21_15-06-17'
'train_fn_ffe7cd72_1684_activation_function=nn_Tanh,c_hidden=2056,dp_rate_linear=0.6520,learning_rate=0.0032,num_layers=4,optimizer_2022-12-21_19-45-25'

Each dir containing a checkpoint, log etc.

Is there a way to optimize this for space? For example, if a directory is found to have a better val_acc (my metric of interest for HPO) then we don’t need to keep the less better val_acc HPO attempt, so can I delete these as I go through the HPO process?

My code for running HPO is:


def run_ray(metric='val_acc', mode='max',num_samples=4000,config_dict={}, checkpoint_file_name="full_run_2_hemolysis_ray_ckpt",config_file='full_run_2_hemolysis/hemolysis/best_config.txt',local_dir='full_run_2_hemolysis/hemolysis/runs/'):

    hyperopt_search = HyperOptSearch(metric=metric, mode=mode)

    #change from gpu {"gpu": 1}
    tuner = tune.Tuner(tune.with_resources(train_fn,{"gpu": 1}), tune_config=tune.TuneConfig(num_samples=num_samples,search_alg=hyperopt_search),param_space=config_dict,run_config= RunConfig(local_dir=local_dir))
    results = tuner.fit()   
    best_result = results.get_best_result(metric=metric, mode=mode) 

    config_file = open(config_file, 'a')
    config_file.write(str(best_result.config) + '\n')

    best_checkpoint = best_result.checkpoint
    path = os.path.join(str(best_checkpoint.to_directory()), checkpoint_file_name)
    print(path)

    model = GraphLevelGNN.load_from_checkpoint(path)
    config_file.write(str(best_result.log_dir))
    config_file.close()

    return best_result.log_dir,model

I’m not sure if I’ve given enough information, I just didn’t want to overload with irrelevent details. Or also maybe this is more a question about ray, let me know if it’s not appropriate.