Hi everyone,
I would like to parallelize the folds training in n-fold validation by splitting the folds on k different GPUs, i.e. data parallelism at the fold level.
I have read some tutorials for module-level parallelism, and am wondering if there is a similar tutorial for parallelism at the fold level.
I have written up a pseudo-code, but I am unsure of how to implement these. I have tried using Python’s concurrent library. Nevertheless, it does not seem to work for multiple GPUs.
for idx_fold, (index_train, index_test) in enumerate(kf.split(data)):
if idx_fold % 2 == 0:
# train on the 1st GPU
# and when the 1st GPU training finished, it will train the next fold allocated to the GPU
else:
# train on the 2nd GPU
# and when the 1st GPU training finished, it will train the next fold allocated to the GPU
In addition, it would be great if all the distributed training on k GPU can be regarded as a single job on a slurm GPU server (as seen by the admin).
Thanks a lot for your help!