Running independent processes is a bad idea (IMHO) if you have a large pre-processed dataset. You will quickly run out of memory. Alternatively, if you load samples from disk, you are slowed down by disk I/O (and having to unnecessarily redo preprocessing of the raw samples). Isn’t there a way to train multiple independent copies of the same (diffently initialized) model in parallel, but in such a way that each copy of the model would read from exactly the same location in GPU (or if that’s not possible, CPU) memory storing the dataset/dataloader object? I asked this question in a similar thread here