Hi,
I tried 2 methods to start distributed:
method 1:
call torch.multiprocessing.spawn function to start n processes. on 1 computer with multi-GPUs
method 2:
call torch.distributed.launch to start n processes on 1 computer with multi-GPUs
if I used method 1, and used ctrl + c to stop code, sub-processing will not stop.
if I used mehod 2, and used ctrl + c to stop code, sub-processing will stop.
my questions are:
- for method 1, how to stop sub-processing in python code?
- for method 1, could the start code be run in python code?
#!/bin/bash
NUM_PROC=$1
shift
python3 -m torch.distributed.launch --master_port=44145 --nproc_per_node=$NUM_PROC train.py "$@"