Workflow and scheduling of training runs


I have a model that I want to train with multiple different hyperparameters. I later want to compare different runs.

Currently I am manually starting each training with different hyperparameters which is very slow and unstructured.

I need to find a better way of working! So I started looking for tools that could help me. Currently I am considering ML-Flow for documenting and saving models. It seems easy to add to my training scripts and will give me structured results.

However, it does not solve the current problem of starting multiple training runs automatically. Right now I can run about 10 runs in parallel on my current hardware but I will need to move to cloud once models get larger. I have a tmux with multiple splits and just start the script in each pane manually (not very effective).

Does anyone have experience with a good tool for deploying training runs and scheduling them when hardware is limited? I would like to just put my training runs in a queue and let some framework handle starting them and running them. For example, lets say I have a docker image with my code (or conda environment or something similar) where all I need is a config file for the training. If I have 10 different config files and want to start a training run for each config file, what is the best way to go about this?

Is there a tool that can work both locally and on cloud? Or does anyone have any recommendations for me where I should start looking?

Also what would be lovely is a tool for cloud that can be used with Spotinst on AWS or any similar service with other cloud providers.