Is it possible to run hyperparameter optimization on multiple GPUs in parallel?

Suppose I have a single node with 4 GPUs. I would like to do model selection w.r.t a specific dataset through random search. A manager will keep track of a grid of such hyperparams. At each iteration a distinct model with a specific setting will be created and trained on an assigned GPU. The training dataset is shared by all such models. A minimal example will be like following

class HyperSearchManager:
    def __init__(self, 
                 param_grid: Dict[str, List]):
        self.train_dataset = train_dataset
        self.valid_dataset = valid_dataset
        self.test_dataset = test_dataset
        self.param_grid = param_grid = float('inf')
        self.optimal_model = None

    def param_iter(self) -> Dict:
        yield params

    def train_single_model(self, model: nn.Module, num_epoch: int, device: torch.device):
        # copy model to the respective device
        model =
        # train loops for a single model
        loader =, batch_size, ...)
        optimizer = torch.optim.Adam(model.parameters(), lr, ...)
        for epoch in range(num_epoch):
            for data in loader:
                data =
                train(model, data, optimizer)
            # Do validation with early stopping, etc.
            valid_loss = validation(model, self.valid_dataset)
        # update optimal model according to valid metrics
        self.update(model.cpu(), valid_loss)

    def update(self, model, valid_loss):
        # if valid_loss is minimal, keep current model
        if valid_loss <
            self.optimal_model = model

    def search(self):
        for _ in range(MAX_HYPER_OPT_ITER):
            params = next(self.param_iter) # get next hyperparam combination
            model = ModuleClass(**params) # create model for the specific hyperparam

            # if a free gpu is available, create a new subprocess to run the model on the allocated gpu
            device = self.get_available_device()
            proc = multiprocessing.Process(target=self.train_single_model, args=(model, num_epochs, device))
            # else waiting...

        run_test(self.optimal_model, self.test_dataset)                

I wonder if it is possible to find a schedule to allocate idle gpu for a pending model. That is, at first 4 models are trained on 4 gpus, respectively. Once a training process is finished, a new model will be assigned to the released GPU.

If that’s not straightforward, is there any easy implementation for such functionalities?

As far as I know PyTorch doesn’t provide a framework to do this automatically. You will have to build this scheduling mechanism in your application itself.