Distributed training with CPU's

Here is how to launch the code on Jupyter.

import os
os.environ["CUDA_VISIBLE_DEVICES"] = ""

import time
import ignite.distributed as idist

def training(local_rank, config, **kwargs):
    time.sleep(local_rank)
    print(idist.get_rank(), ': run with config:', config, '- backend=', idist.backend())
    # do the training ...

backend = 'gloo'
dist_configs = {'nproc_per_node': 4, "start_method": "fork"}
config = {'c': 12345}

with idist.Parallel(backend=backend, **dist_configs) as parallel:
    parallel.run(training, config, a=1, b=2)

You have to use start_method="fork".

If you would like to run it as a script file and spawn processes from your main.py script as you do, then you can use default start_method. Also, it could be helpful to set persistent_workers=True for the DataLoader to speed up data fetching every epoch…
If you would like to use a script file and spawn processes with torch.distributed.launch, you can simply reuse the command from my previous message (and no need to set persistent_workers=True).