Distributed training queue

mag1 · June 24, 2021, 10:37am

Hi,
I am wondering if anyone came across a solution for this or maybe there is some work in progress.

Basically in most cases a Machine Learning engineer is running an instance of iPython notebook and running experiments from this instance. This includes training new models

What would be cool is to be able to send the model, training loop together with the data (just tensors) to a remote powerful GPU cluster, that maintains a queue of jobs. This jobs can then run uninterrupted with results being reported through web interface, the ability to download the models for local evaluation would also be great.