My code used to work in PyTorch 1.6. Recently it was upgraded to 1.9. When I try to do training under distributed mode(but actually I only have 1 PC with 2 GPUs, not several PCs), following error happens, sorry for the long log, I’ve never seen it before and totally lost.
$ python -m torch.distributed.run --standalone --nnodes=1 --nproc_per_node=2 train_dist_2.py
[INFO] 2021-08-13 18:21:14,035 run: Running torch.distributed.run with args: [’/usr/lib/python3.9/site-packages/torch/distributed/run.py’, ‘–standalone’, ‘–nnodes=1’, ‘–nproc_per_node=2’, ‘train_dist_2.py’]
[INFO] 2021-08-13 18:21:14,036 run:
Rendezvous info:
–rdzv_backend=c10d --rdzv_endpoint=localhost:29400 --rdzv_id=5c6a0ec7-2728-407d-8d25-7dde979518e6
[INFO] 2021-08-13 18:21:14,036 run: Using nproc_per_node=2.
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
[INFO] 2021-08-13 18:21:14,036 api: Starting elastic_operator with launch configs:
entrypoint : train_dist_2.py
min_nodes : 1
max_nodes : 1
nproc_per_node : 2
run_id : 5c6a0ec7-2728-407d-8d25-7dde979518e6
rdzv_backend : c10d
rdzv_endpoint : localhost:29400
rdzv_configs : {‘timeout’: 900}
max_restarts : 3
monitor_interval : 5
log_dir : None
metrics_cfg : {}[INFO] 2021-08-13 18:21:14,059 c10d_rendezvous_backend: Process 25097 hosts the TCP store for the C10d rendezvous backend.
[INFO] 2021-08-13 18:21:14,060 local_elastic_agent: log directory set to: /tmp/torchelastic_ra_2ujgp/5c6a0ec7-2728-407d-8d25-7dde979518e6_5pfgxyog
[INFO] 2021-08-13 18:21:14,060 api: [default] starting workers for entrypoint: python
[INFO] 2021-08-13 18:21:14,060 api: [default] Rendezvous’ing worker group
[INFO] 2021-08-13 18:21:14,060 dynamic_rendezvous: The node ‘cnn_25097_0’ attempts to join the next round of the rendezvous ‘5c6a0ec7-2728-407d-8d25-7dde979518e6’.
[INFO] 2021-08-13 18:21:14,277 dynamic_rendezvous: The node ‘cnn_25097_0’ has joined round 0 of the rendezvous ‘5c6a0ec7-2728-407d-8d25-7dde979518e6’ as rank 0 in a world of size 1.
/usr/lib/python3.9/site-packages/torch/distributed/elastic/utils/store.py:52: FutureWarning: This is an experimental API and will be changed in future.
warnings.warn(
[INFO] 2021-08-13 18:21:14,278 api: [default] Rendezvous complete for workers. Result:
restart_count=0
master_addr=cnn
master_port=36965
group_rank=0
group_world_size=1
local_ranks=[0, 1]
role_ranks=[0, 1]
global_ranks=[0, 1]
role_world_sizes=[2, 2]
global_world_sizes=[2, 2][INFO] 2021-08-13 18:21:14,278 api: [default] Starting worker group
[INFO] 2021-08-13 18:21:14,278 init: Setting worker0 reply file to: /tmp/torchelastic_ra_2ujgp/5c6a0ec7-2728-407d-8d25-7dde979518e6_5pfgxyog/attempt_0/0/error.json
[INFO] 2021-08-13 18:21:14,278 init: Setting worker1 reply file to: /tmp/torchelastic_ra_2ujgp/5c6a0ec7-2728-407d-8d25-7dde979518e6_5pfgxyog/attempt_0/1/error.json
WARNING: Logging before InitGoogleLogging() is written to STDERR
I20210813 18:21:23.793745 25104 ProcessGroupNCCL.cpp:480] [Rank 1] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 1
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 1800000
USE_HIGH_PRIORITY_STREAM: 0
NCCL_DEBUG: UNSET
I20210813 18:21:23.793756 25154 ProcessGroupNCCL.cpp:580] [Rank 1] NCCL watchdog thread started!
WARNING: Logging before InitGoogleLogging() is written to STDERR
I20210813 18:21:23.801612 25103 ProcessGroupNCCL.cpp:480] [Rank 0] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 1
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 1800000
USE_HIGH_PRIORITY_STREAM: 0
NCCL_DEBUG: UNSET
I20210813 18:21:23.801617 25157 ProcessGroupNCCL.cpp:580] [Rank 0] NCCL watchdog thread started!
Device check. Running model on cudaDevice check. Running model on cuda
Traceback (most recent call last):
Traceback (most recent call last):
File “/home/xxx/Desktop/MobileFaceNet-PyTorch/train_dist_2.py”, line 285, in
File “/home/xxx/Desktop/MobileFaceNet-PyTorch/train_dist_2.py”, line 285, in
main(args)main(args)File “/home/xxx/Desktop/MobileFaceNet-PyTorch/train_dist_2.py”, line 135, in main
File “/home/xxx/Desktop/MobileFaceNet-PyTorch/train_dist_2.py”, line 135, in main
model = nn.parallel.DistributedDataParallel(model, device_ids = [args.local_rank])model = nn.parallel.DistributedDataParallel(model, device_ids = [args.local_rank])File “/usr/lib/python3.9/site-packages/torch/nn/parallel/distributed.py”, line 496, in init
File “/usr/lib/python3.9/site-packages/torch/nn/parallel/distributed.py”, line 496, in init
dist._verify_model_across_ranks(self.process_group, parameters)dist._verify_model_across_ranks(self.process_group, parameters)RuntimeErrorRuntimeError: : NCCL error in: …/torch/lib/c10d/ProcessGroupNCCL.cpp:911, invalid usage, NCCL version 20.9.9
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).NCCL error in: …/torch/lib/c10d/ProcessGroupNCCL.cpp:911, invalid usage, NCCL version 20.9.9
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).I20210813 18:21:24.402045 25154 ProcessGroupNCCL.cpp:582] [Rank 1] NCCL watchdog thread terminated normally
I20210813 18:21:24.403731 25157 ProcessGroupNCCL.cpp:582] [Rank 0] NCCL watchdog thread terminated normally
[ERROR] 2021-08-13 18:21:29,303 api: failed (exitcode: 1) local_rank: 0 (pid: 25103) of binary: /usr/bin/python
[ERROR] 2021-08-13 18:21:29,303 local_elastic_agent: [default] Worker group failed
[INFO] 2021-08-13 18:21:29,303 api: [default] Worker group FAILED. 3/3 attempts left; will restart worker group
[INFO] 2021-08-13 18:21:29,303 api: [default] Stopping worker group
[INFO] 2021-08-13 18:21:29,303 api: [default] Rendezvous’ing worker group
[INFO] 2021-08-13 18:21:29,303 dynamic_rendezvous: The node ‘cnn_25097_0’ attempts to join the next round of the rendezvous ‘5c6a0ec7-2728-407d-8d25-7dde979518e6’.
[INFO] 2021-08-13 18:21:29,422 dynamic_rendezvous: The node ‘cnn_25097_0’ has joined round 1 of the rendezvous ‘5c6a0ec7-2728-407d-8d25-7dde979518e6’ as rank 0 in a world of size 1.
[INFO] 2021-08-13 18:21:29,423 api: [default] Rendezvous complete for workers. Result:
restart_count=1
master_addr=cnn
master_port=53181
group_rank=0
group_world_size=1
local_ranks=[0, 1]
role_ranks=[0, 1]
global_ranks=[0, 1]
role_world_sizes=[2, 2]
global_world_sizes=[2, 2][INFO] 2021-08-13 18:21:29,423 api: [default] Starting worker group
[INFO] 2021-08-13 18:21:29,423 init: Setting worker0 reply file to: /tmp/torchelastic_ra_2ujgp/5c6a0ec7-2728-407d-8d25-7dde979518e6_5pfgxyog/attempt_1/0/error.json
[INFO] 2021-08-13 18:21:29,423 init: Setting worker1 reply file to: /tmp/torchelastic_ra_2ujgp/5c6a0ec7-2728-407d-8d25-7dde979518e6_5pfgxyog/attempt_1/1/error.json
WARNING: Logging before InitGoogleLogging() is written to STDERR
I20210813 18:21:38.944895 25196 ProcessGroupNCCL.cpp:480] [Rank 1] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 1
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 1800000
USE_HIGH_PRIORITY_STREAM: 0
NCCL_DEBUG: UNSET
I20210813 18:21:38.944903 25245 ProcessGroupNCCL.cpp:580] [Rank 1] NCCL watchdog thread started!
WARNING: Logging before InitGoogleLogging() is written to STDERR
I20210813 18:21:38.954780 25195 ProcessGroupNCCL.cpp:480] [Rank 0] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 1
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 1800000
USE_HIGH_PRIORITY_STREAM: 0
NCCL_DEBUG: UNSET
I20210813 18:21:38.954794 25248 ProcessGroupNCCL.cpp:580] [Rank 0] NCCL watchdog thread started!
Device check. Running model on cudaDevice check. Running model on cuda
Traceback (most recent call last):
Traceback (most recent call last):
File “/home/xxx/Desktop/MobileFaceNet-PyTorch/train_dist_2.py”, line 285, in
File “/home/xxx/Desktop/MobileFaceNet-PyTorch/train_dist_2.py”, line 285, in
main(args)main(args)File “/home/xxx/Desktop/MobileFaceNet-PyTorch/train_dist_2.py”, line 135, in main
File “/home/xxx/Desktop/MobileFaceNet-PyTorch/train_dist_2.py”, line 135, in main
model = nn.parallel.DistributedDataParallel(model, device_ids = [args.local_rank])model = nn.parallel.DistributedDataParallel(model, device_ids = [args.local_rank])File “/usr/lib/python3.9/site-packages/torch/nn/parallel/distributed.py”, line 496, in init
File “/usr/lib/python3.9/site-packages/torch/nn/parallel/distributed.py”, line 496, in init
dist._verify_model_across_ranks(self.process_group, parameters)dist._verify_model_across_ranks(self.process_group, parameters)RuntimeErrorRuntimeError: : NCCL error in: …/torch/lib/c10d/ProcessGroupNCCL.cpp:911, invalid usage, NCCL version 20.9.9
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).NCCL error in: …/torch/lib/c10d/ProcessGroupNCCL.cpp:911, invalid usage, NCCL version 20.9.9
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).I20210813 18:21:39.523105 25248 ProcessGroupNCCL.cpp:582] [Rank 0] NCCL watchdog thread terminated normally
I20210813 18:21:39.523314 25245 ProcessGroupNCCL.cpp:582] [Rank 1] NCCL watchdog thread terminated normally
[ERROR] 2021-08-13 18:21:44,447 api: failed (exitcode: 1) local_rank: 0 (pid: 25195) of binary: /usr/bin/python
[ERROR] 2021-08-13 18:21:44,447 local_elastic_agent: [default] Worker group failed
[INFO] 2021-08-13 18:21:44,447 api: [default] Worker group FAILED. 2/3 attempts left; will restart worker group
[INFO] 2021-08-13 18:21:44,447 api: [default] Stopping worker group
[INFO] 2021-08-13 18:21:44,447 api: [default] Rendezvous’ing worker group
[INFO] 2021-08-13 18:21:44,447 dynamic_rendezvous: The node ‘cnn_25097_0’ attempts to join the next round of the rendezvous ‘5c6a0ec7-2728-407d-8d25-7dde979518e6’.
[INFO] 2021-08-13 18:21:44,448 dynamic_rendezvous: The node ‘cnn_25097_0’ has joined round 2 of the rendezvous ‘5c6a0ec7-2728-407d-8d25-7dde979518e6’ as rank 0 in a world of size 1.
[INFO] 2021-08-13 18:21:44,449 api: [default] Rendezvous complete for workers. Result:
restart_count=2
master_addr=cnn
master_port=35757
group_rank=0
group_world_size=1
local_ranks=[0, 1]
role_ranks=[0, 1]
global_ranks=[0, 1]
role_world_sizes=[2, 2]
global_world_sizes=[2, 2][INFO] 2021-08-13 18:21:44,449 api: [default] Starting worker group
[INFO] 2021-08-13 18:21:44,449 init: Setting worker0 reply file to: /tmp/torchelastic_ra_2ujgp/5c6a0ec7-2728-407d-8d25-7dde979518e6_5pfgxyog/attempt_2/0/error.json
[INFO] 2021-08-13 18:21:44,449 init: Setting worker1 reply file to: /tmp/torchelastic_ra_2ujgp/5c6a0ec7-2728-407d-8d25-7dde979518e6_5pfgxyog/attempt_2/1/error.json
WARNING: Logging before InitGoogleLogging() is written to STDERR
I20210813 18:21:52.939616 25287 ProcessGroupNCCL.cpp:480] [Rank 1] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 1
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 1800000
USE_HIGH_PRIORITY_STREAM: 0
NCCL_DEBUG: UNSET
I20210813 18:21:52.939630 25330 ProcessGroupNCCL.cpp:580] [Rank 1] NCCL watchdog thread started!
WARNING: Logging before InitGoogleLogging() is written to STDERR
I20210813 18:21:52.949156 25286 ProcessGroupNCCL.cpp:480] [Rank 0] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 1
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 1800000
USE_HIGH_PRIORITY_STREAM: 0
NCCL_DEBUG: UNSET
I20210813 18:21:52.949156 25333 ProcessGroupNCCL.cpp:580] [Rank 0] NCCL watchdog thread started!
Device check. Running model on cudaDevice check. Running model on cuda
Traceback (most recent call last):
Traceback (most recent call last):
File “/home/xxx/Desktop/MobileFaceNet-PyTorch/train_dist_2.py”, line 285, in
File “/home/xxx/Desktop/MobileFaceNet-PyTorch/train_dist_2.py”, line 285, in
main(args)main(args)File “/home/xxx/Desktop/MobileFaceNet-PyTorch/train_dist_2.py”, line 135, in main
File “/home/xxx/Desktop/MobileFaceNet-PyTorch/train_dist_2.py”, line 135, in main
model = nn.parallel.DistributedDataParallel(model, device_ids = [args.local_rank])model = nn.parallel.DistributedDataParallel(model, device_ids = [args.local_rank])File “/usr/lib/python3.9/site-packages/torch/nn/parallel/distributed.py”, line 496, in init
File “/usr/lib/python3.9/site-packages/torch/nn/parallel/distributed.py”, line 496, in init
dist._verify_model_across_ranks(self.process_group, parameters)dist._verify_model_across_ranks(self.process_group, parameters)RuntimeError: RuntimeErrorNCCL error in: …/torch/lib/c10d/ProcessGroupNCCL.cpp:911, invalid usage, NCCL version 20.9.9
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).:
NCCL error in: …/torch/lib/c10d/ProcessGroupNCCL.cpp:911, invalid usage, NCCL version 20.9.9
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).
I20210813 18:21:53.513854 25330 ProcessGroupNCCL.cpp:582] [Rank 1] NCCL watchdog thread terminated normally
I20210813 18:21:53.513996 25333 ProcessGroupNCCL.cpp:582] [Rank 0] NCCL watchdog thread terminated normally
[ERROR] 2021-08-13 18:21:54,472 api: failed (exitcode: 1) local_rank: 0 (pid: 25286) of binary: /usr/bin/python
[ERROR] 2021-08-13 18:21:54,472 local_elastic_agent: [default] Worker group failed
[INFO] 2021-08-13 18:21:54,472 api: [default] Worker group FAILED. 1/3 attempts left; will restart worker group
[INFO] 2021-08-13 18:21:54,472 api: [default] Stopping worker group
[INFO] 2021-08-13 18:21:54,472 api: [default] Rendezvous’ing worker group
[INFO] 2021-08-13 18:21:54,472 dynamic_rendezvous: The node ‘cnn_25097_0’ attempts to join the next round of the rendezvous ‘5c6a0ec7-2728-407d-8d25-7dde979518e6’.
[INFO] 2021-08-13 18:21:54,473 dynamic_rendezvous: The node ‘cnn_25097_0’ has joined round 3 of the rendezvous ‘5c6a0ec7-2728-407d-8d25-7dde979518e6’ as rank 0 in a world of size 1.
[INFO] 2021-08-13 18:21:54,474 api: [default] Rendezvous complete for workers. Result:
restart_count=3
master_addr=cnn
master_port=44399
group_rank=0
group_world_size=1
local_ranks=[0, 1]
role_ranks=[0, 1]
global_ranks=[0, 1]
role_world_sizes=[2, 2]
global_world_sizes=[2, 2][INFO] 2021-08-13 18:21:54,474 api: [default] Starting worker group
[INFO] 2021-08-13 18:21:54,474 init: Setting worker0 reply file to: /tmp/torchelastic_ra_2ujgp/5c6a0ec7-2728-407d-8d25-7dde979518e6_5pfgxyog/attempt_3/0/error.json
[INFO] 2021-08-13 18:21:54,474 init: Setting worker1 reply file to: /tmp/torchelastic_ra_2ujgp/5c6a0ec7-2728-407d-8d25-7dde979518e6_5pfgxyog/attempt_3/1/error.json
WARNING: Logging before InitGoogleLogging() is written to STDERR
I20210813 18:22:03.975812 25356 ProcessGroupNCCL.cpp:480] [Rank 1] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 1
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 1800000
USE_HIGH_PRIORITY_STREAM: 0
NCCL_DEBUG: UNSET
I20210813 18:22:03.975847 25408 ProcessGroupNCCL.cpp:580] [Rank 1] NCCL watchdog thread started!
WARNING: Logging before InitGoogleLogging() is written to STDERR
I20210813 18:22:03.977819 25411 ProcessGroupNCCL.cpp:580] [Rank 0] NCCL watchdog thread started!
I20210813 18:22:03.977841 25355 ProcessGroupNCCL.cpp:480] [Rank 0] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 1
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 1800000
USE_HIGH_PRIORITY_STREAM: 0
NCCL_DEBUG: UNSET
Device check. Running model on cudaDevice check. Running model on cuda
Traceback (most recent call last):
Traceback (most recent call last):
File “/home/xxx/Desktop/MobileFaceNet-PyTorch/train_dist_2.py”, line 285, in
File “/home/xxx/Desktop/MobileFaceNet-PyTorch/train_dist_2.py”, line 285, in
main(args)main(args)File “/home/xxx/Desktop/MobileFaceNet-PyTorch/train_dist_2.py”, line 135, in main
File “/home/xxx/Desktop/MobileFaceNet-PyTorch/train_dist_2.py”, line 135, in main
model = nn.parallel.DistributedDataParallel(model, device_ids = [args.local_rank])model = nn.parallel.DistributedDataParallel(model, device_ids = [args.local_rank])File “/usr/lib/python3.9/site-packages/torch/nn/parallel/distributed.py”, line 496, in init
File “/usr/lib/python3.9/site-packages/torch/nn/parallel/distributed.py”, line 496, in init
dist._verify_model_across_ranks(self.process_group, parameters)dist._verify_model_across_ranks(self.process_group, parameters)RuntimeErrorRuntimeError: : NCCL error in: …/torch/lib/c10d/ProcessGroupNCCL.cpp:911, invalid usage, NCCL version 20.9.9
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).NCCL error in: …/torch/lib/c10d/ProcessGroupNCCL.cpp:911, invalid usage, NCCL version 20.9.9
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).I20210813 18:22:04.551221 25408 ProcessGroupNCCL.cpp:582] [Rank 1] NCCL watchdog thread terminated normally
I20210813 18:22:04.554879 25411 ProcessGroupNCCL.cpp:582] [Rank 0] NCCL watchdog thread terminated normally
[ERROR] 2021-08-13 18:22:09,499 api: failed (exitcode: 1) local_rank: 0 (pid: 25355) of binary: /usr/bin/python
[ERROR] 2021-08-13 18:22:09,499 local_elastic_agent: [default] Worker group failed
[INFO] 2021-08-13 18:22:09,500 api: Local worker group finished (FAILED). Waiting 300 seconds for other agents to finish
/usr/lib/python3.9/site-packages/torch/distributed/elastic/utils/store.py:70: FutureWarning: This is an experimental API and will be changed in future.
warnings.warn(
[INFO] 2021-08-13 18:22:09,500 api: Done waiting for other agents. Elapsed: 0.00022149085998535156 seconds
{“name”: “torchelastic.worker.status.FAILED”, “source”: “WORKER”, “timestamp”: 0, “metadata”: {“run_id”: “5c6a0ec7-2728-407d-8d25-7dde979518e6”, “global_rank”: 0, “group_rank”: 0, “worker_id”: “25355”, “role”: “default”, “hostname”: “cnn”, “state”: “FAILED”, “total_run_time”: 55, “rdzv_backend”: “c10d”, “raw_error”: “{“message”: “”}”, “metadata”: “{“group_world_size”: 1, “entry_point”: “python”, “local_rank”: [0], “role_rank”: [0], “role_world_size”: [2]}”, “agent_restarts”: 3}}
{“name”: “torchelastic.worker.status.FAILED”, “source”: “WORKER”, “timestamp”: 0, “metadata”: {“run_id”: “5c6a0ec7-2728-407d-8d25-7dde979518e6”, “global_rank”: 1, “group_rank”: 0, “worker_id”: “25356”, “role”: “default”, “hostname”: “cnn”, “state”: “FAILED”, “total_run_time”: 55, “rdzv_backend”: “c10d”, “raw_error”: “{“message”: “”}”, “metadata”: “{“group_world_size”: 1, “entry_point”: “python”, “local_rank”: [1], “role_rank”: [1], “role_world_size”: [2]}”, “agent_restarts”: 3}}
{“name”: “torchelastic.worker.status.SUCCEEDED”, “source”: “AGENT”, “timestamp”: 0, “metadata”: {“run_id”: “5c6a0ec7-2728-407d-8d25-7dde979518e6”, “global_rank”: null, “group_rank”: 0, “worker_id”: null, “role”: “default”, “hostname”: “cnn”, “state”: “SUCCEEDED”, “total_run_time”: 55, “rdzv_backend”: “c10d”, “raw_error”: null, “metadata”: “{“group_world_size”: 1, “entry_point”: “python”}”, “agent_restarts”: 3}}
[INFO] 2021-08-13 18:22:09,501 dynamic_rendezvous: The node ‘cnn_25097_0’ has closed the rendezvous ‘5c6a0ec7-2728-407d-8d25-7dde979518e6’.
/usr/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py:354: UserWarning:
CHILD PROCESS FAILED WITH NO ERROR_FILE
CHILD PROCESS FAILED WITH NO ERROR_FILE
Child process 25355 (local_rank 0) FAILED (exitcode 1)
Error msg: Process failed with exitcode 1
Without writing an error file to <N/A>.
While this DOES NOT affect the correctness of your application,
no trace information about the error will be available for inspection.
Consider decorating your top level entrypoint function with
torch.distributed.elastic.multiprocessing.errors.record. Example:from torch.distributed.elastic.multiprocessing.errors import record
@record
def trainer_main(args):
# do train
warnings.warn(_no_error_file_warning_msg(rank, failure))
Traceback (most recent call last):
File “/usr/lib/python3.9/runpy.py”, line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File “/usr/lib/python3.9/runpy.py”, line 87, in _run_code
exec(code, run_globals)
File “/usr/lib/python3.9/site-packages/torch/distributed/run.py”, line 637, in
main()
File “/usr/lib/python3.9/site-packages/torch/distributed/run.py”, line 629, in main
run(args)
File “/usr/lib/python3.9/site-packages/torch/distributed/run.py”, line 621, in run
elastic_launch(
File “/usr/lib/python3.9/site-packages/torch/distributed/launcher/api.py”, line 116, in call
return launch_agent(self._config, self._entrypoint, list(args))
File “/usr/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py”, line 348, in wrapper
return f(*args, **kwargs)
File “/usr/lib/python3.9/site-packages/torch/distributed/launcher/api.py”, line 245, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
train_dist_2.py FAILED
=======================================
Root Cause:
[0]:
time: 2021-08-13_18:22:09
rank: 0 (local_rank: 0)
exitcode: 1 (pid: 25355)
error_file: <N/A>
msg: “Process failed with exitcode 1”Other Failures:
[1]:
time: 2021-08-13_18:22:09
rank: 1 (local_rank: 1)
exitcode: 1 (pid: 25356)
error_file: <N/A>
msg: “Process failed with exitcode 1”
Any help is appreciated