Parallelize n-fold validation on k different GPU in a single process

Hi everyone,

I would like to parallelize the folds training in n-fold validation by splitting the folds on k different GPUs, i.e. data parallelism at the fold level.

I have read some tutorials for module-level parallelism, and am wondering if there is a similar tutorial for parallelism at the fold level.

I have written up a pseudo-code, but I am unsure of how to implement these. I have tried using Python’s concurrent library. Nevertheless, it does not seem to work for multiple GPUs.

for idx_fold, (index_train, index_test) in enumerate(kf.split(data)):
    if idx_fold % 2 == 0:
        # train on the 1st GPU
        # and when the 1st GPU training finished, it will train the next fold allocated to the GPU
    else:
        # train on the 2nd GPU
        # and when the 1st GPU training finished, it will train the next fold allocated to the GPU

In addition, it would be great if all the distributed training on k GPU can be regarded as a single job on a slurm GPU server (as seen by the admin).

Thanks a lot for your help! :grinning: :grinning: