Suppose I have a single node with 4 GPUs. I would like to do model selection w.r.t a specific dataset through random search. A manager will keep track of a grid of such hyperparams. At each iteration a distinct model with a specific setting will be created and trained on an assigned GPU. The training dataset is shared by all such models. A minimal example will be like following
class HyperSearchManager:
def __init__(self,
train_dataset: torch.utils.data.Dataset,
valid_dataset: torch.utils.data.Dataset,
test_dataset: torch.utils.data.Dataset,
param_grid: Dict[str, List]):
self.train_dataset = train_dataset
self.valid_dataset = valid_dataset
self.test_dataset = test_dataset
self.param_grid = param_grid
self.best = float('inf')
self.optimal_model = None
def param_iter(self) -> Dict:
...
yield params
def train_single_model(self, model: nn.Module, num_epoch: int, device: torch.device):
# copy model to the respective device
model = model.to(device)
# train loops for a single model
loader = torch.utils.data.DataLoader(self.train_dataset, batch_size, ...)
optimizer = torch.optim.Adam(model.parameters(), lr, ...)
for epoch in range(num_epoch):
for data in loader:
data = data.to(device)
train(model, data, optimizer)
...
# Do validation with early stopping, etc.
valid_loss = validation(model, self.valid_dataset)
# update optimal model according to valid metrics
self.update(model.cpu(), valid_loss)
def update(self, model, valid_loss):
# if valid_loss is minimal, keep current model
if valid_loss < self.best:
self.optimal_model = model
def search(self):
for _ in range(MAX_HYPER_OPT_ITER):
params = next(self.param_iter) # get next hyperparam combination
model = ModuleClass(**params) # create model for the specific hyperparam
# if a free gpu is available, create a new subprocess to run the model on the allocated gpu
device = self.get_available_device()
proc = multiprocessing.Process(target=self.train_single_model, args=(model, num_epochs, device))
proc.start()
# else waiting...
run_test(self.optimal_model, self.test_dataset)
I wonder if it is possible to find a schedule to allocate idle gpu for a pending model. That is, at first 4 models are trained on 4 gpus, respectively. Once a training process is finished, a new model will be assigned to the released GPU.
If that’s not straightforward, is there any easy implementation for such functionalities?