How to run PyTorch DDP with Hydra + Optuna?

deepakpokkalla · November 22, 2024, 5:40pm

I am trying to use Hydra + Optuna Sweeper with my PyTorch + DDP (mp.spawn) setup. I have the PyTorch + DDP running properly. However, when I’m trying to run the hydra sweeper, it;s not working - as in, it stops after just one case. So, I am wondering how to make it work and tune the hyperparameters?

In the below, the main function is launched by the manual_proc_runner spawned across GPUs.

def manual_proc_runner(rank, cfg, m, results): # for current process
    world_size = m.get_world_size() # 2
    print('world_size:',world_size)

    manual_proc_setup(rank=rank, world_size=world_size)

    # ----- DDP Manual Env Setup -----
    set_seed(seed=42)
    env = CustomManualEnvironment(world_size=world_size, rank=rank)
    
    # main(m=m, env=env, args=args)
    loss = main(cfg=cfg, m=m, env=env)

    results[rank] = loss
    print(f'results:{results} in rank:{rank}')

    # clean up DDP process group after completion
    manual_proc_cleanup()

@hydra.main(version_base=None, config_path='../config_files', config_name='config')
def hydra_manual_proc_launcher(cfg): # no arguments other than 'cfg' since hydra
    m = ModelConfig('./config_files/drivaer_domains_ddp.ini')
    world_size = m.get_world_size() # 2
    print('world_size:',world_size)

    out_dir = hydra.core.hydra_config.HydraConfig.get().runtime.output_dir
    print(f'out_dir:{out_dir}')

    with mp.Manager() as manager:
        results = manager.list([None]* world_size)
        
        mp.spawn(manual_proc_runner, args=(cfg,m,results), nprocs=world_size, join=True)
        
        print(f'results:{results}')
        aggregated_loss = sum(filter(None,results)) / len(results)
        print(f'Aggregated loss: {aggregated_loss}')

        return aggregated_loss

def main(cfg, m, env):

This is what I have, along with the following config file. When I run this, default hyper-parameters are chosen and then it stops instead of running the whole optimization loop.

defaults:
  - override hydra/sweeper: optuna
  - override hydra/sweeper/sampler: tpe

storage: 'sqlite:///../output/train/checkpoints/subdomains_211_v2/optuna_dashboard.db'

hydra:
  sweeper:
    sampler:
      seed: 123
    direction: minimize
    study_name: hydra-1
    # storage: null
    n_trials: 4 #50
    n_jobs: 1
    params:
      mp_iterations: range(1, 15)
      mlp_dim: range(64,256)
      mlp_layers: range(1,4)

mp_iterations: 15
mlp_dim: 64
mlp_layers: 2

# if true, simulate a failure by raising an exception
error: false