Here is how to launch the code on Jupyter.
import os
os.environ["CUDA_VISIBLE_DEVICES"] = ""
import time
import ignite.distributed as idist
def training(local_rank, config, **kwargs):
time.sleep(local_rank)
print(idist.get_rank(), ': run with config:', config, '- backend=', idist.backend())
# do the training ...
backend = 'gloo'
dist_configs = {'nproc_per_node': 4, "start_method": "fork"}
config = {'c': 12345}
with idist.Parallel(backend=backend, **dist_configs) as parallel:
parallel.run(training, config, a=1, b=2)
You have to use start_method="fork"
.
If you would like to run it as a script file and spawn processes from your main.py
script as you do, then you can use default start_method
. Also, it could be helpful to set persistent_workers=True
for the DataLoader to speed up data fetching every epoch…
If you would like to use a script file and spawn processes with torch.distributed.launch
, you can simply reuse the command from my previous message (and no need to set persistent_workers=True
).