NCCL timed out when using the torch.distributed.run

nikiguo93 · June 3, 2022, 7:49pm

Hi, when I use the DDP to train my model, after 1 epoch, I got the folowing error message:
[E ProcessGroupNCCL.cpp:325] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of ‘std::runtime_error’
what(): [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1803203 milliseconds before timing out.

Can anybody help me in solving that problem?

ptrblck · June 4, 2022, 6:22am

Could you rerun your script via:

export NCCL_DEBUG=INFO
export NCCL_DEBUG_SUBSYS=ALL
export TORCH_DISTRIBUTED_DEBUG=INFO

and see if you would receive any warnings or errors?

nikiguo93 · June 4, 2022, 3:31pm

Hi, after adding those I got the following information:

File “/home/nikiguo93/.local/lib/python3.9/site-packages/torch/utils/data/dataloader.py”, line 1229, in _process_data
data.reraise()
File “/home/nikiguo93/.local/lib/python3.9/site-packages/torch/_utils.py”, line 425, in reraise
raise self.exc_type(msg)
IsADirectoryError: Caught IsADirectoryError in DataLoader worker process 1.
Original Traceback (most recent call last):
File “/home/nikiguo93/.local/lib/python3.9/site-packages/torch/utils/data/_utils/worker.py”, line 287, in _worker_loop
data = fetcher.fetch(index)
File “/home/nikiguo93/.local/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py”, line 44, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File “/home/nikiguo93/.local/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py”, line 44, in
data = [self.dataset[idx] for idx in possibly_batched_index]
File “/home/nikiguo93/scratch_multitask_Imagenet.py”, line 109, in getitem
TheImage = Image.open(self.ImageNames[idx]).convert(“RGB”)
File “/usr/lib/python3/dist-packages/PIL/Image.py”, line 2904, in open
fp = builtins.open(filename, “rb”)
IsADirectoryError: [Errno 21] Is a directory: ‘/local_scratch/nikiguo93/train/hpc-work/Codebase’

torch.Size([300, 17, 100])
for train: torch.Size([300, 17, 100])
for train: torch.Size([300, 17, 100])
for train: tensor(0.4474, device=‘cuda:3’, grad_fn=) tensor(6.9942, device=‘cuda:3’, grad_fn=)
tensor(42.7889, device=‘cuda:3’, grad_fn=)
smicro01:681176:681927 [3] NCCL INFO AllReduce: opCount 485 sendbuff 0x7f5c10000000 recvbuff 0x7f5c10000000 count 11037324 datatype 6 op 0 root 0 comm 0x7f5f3c002e10 [nranks=4] stream 0x8070110
smicro01:681176:681927 [3] NCCL INFO AllReduce: opCount 486 sendbuff 0x7f5ab8b00000 recvbuff 0x7f5ab8b00000 count 10289152 datatype 6 op 0 root 0 comm 0x7f5f3c002e10 [nranks=4] stream 0x8070110
tensor(0.4443, device=‘cuda:0’, grad_fn=) tensor(7.0288, device=‘cuda:0’, grad_fn=)
tensor(0.4410, device=‘cuda:1’, grad_fn=) tensor(42.5709, device=‘cuda:0’, grad_fn=)
tensor(7.0794, device=‘cuda:1’, grad_fn=)
tensor(42.3626, device=‘cuda:1’, grad_fn=)
smicro01:681173:681928 [0] NCCL INFO AllReduce: opCount 485 sendbuff 0x7fc842000000 recvbuff 0x7fc842000000 count 11037324 datatype 6 op 0 root 0 comm 0x7fcb64002e10 [nranks=4] stream 0x8d29010
smicro01:681174:681919 [1] NCCL INFO AllReduce: opCount 485 sendbuff 0x7f1418000000 recvbuff 0x7f1418000000 count 11037324 datatype 6 op 0 root 0 comm 0x7f1734002e10 [nranks=4] stream 0x699a060
smicro01:681173:681928 [0] NCCL INFO AllReduce: opCount 486 sendbuff 0x7fc6e4b00000 recvbuff 0x7fc6e4b00000 count 10289152 datatype 6 op 0 root 0 comm 0x7fcb64002e10 [nranks=4] stream 0x8d29010
smicro01:681174:681919 [1] NCCL INFO AllReduce: opCount 486 sendbuff 0x7f12c0b00000 recvbuff 0x7f12c0b00000 count 10289152 datatype 6 op 0 root 0 comm 0x7f1734002e10 [nranks=4] stream 0x699a060
smicro01:681176:681927 [3] NCCL INFO AllReduce: opCount 487 sendbuff 0x7f5aa8000000 recvbuff 0x7f5aa8000000 count 7660736 datatype 6 op 0 root 0 comm 0x7f5f3c002e10 [nranks=4] stream 0x8070110
smicro01:681176:681927 [3] NCCL INFO AllReduce: opCount 488 sendbuff 0x7f5a27648000 recvbuff 0x7f5a27648000 count 53120 datatype 7 op 0 root 0 comm 0x7f5f3c002e10 [nranks=4] stream 0x8070110
smicro01:681173:681928 [0] NCCL INFO AllReduce: opCount 487 sendbuff 0x7fc6d4000000 recvbuff 0x7fc6d4000000 count 7660736 datatype 6 op 0 root 0 comm 0x7fcb64002e10 [nranks=4] stream 0x8d29010
smicro01:681174:681919 [1] NCCL INFO AllReduce: opCount 487 sendbuff 0x7f12b0000000 recvbuff 0x7f12b0000000 count 7660736 datatype 6 op 0 root 0 comm 0x7f1734002e10 [nranks=4] stream 0x699a060
smicro01:681173:681928 [0] NCCL INFO AllReduce: opCount 488 sendbuff 0x7fc653648000 recvbuff 0x7fc653648000 count 53120 datatype 7 op 0 root 0 comm 0x7fcb64002e10 [nranks=4] stream 0x8d29010
smicro01:681174:681919 [1] NCCL INFO AllReduce: opCount 488 sendbuff 0x7f122f648000 recvbuff 0x7f122f648000 count 53120 datatype 7 op 0 root 0 comm 0x7f1734002e10 [nranks=4] stream 0x699a060
[ERROR] 2022-06-04 17:20:41,039 api: failed (exitcode: 1) local_rank: 2 (pid: 681175) of binary: /usr/bin/python3
[ERROR] 2022-06-04 17:20:41,039 local_elastic_agent: [default] Worker group failed
[INFO] 2022-06-04 17:20:41,040 api: Local worker group finished (FAILED). Waiting 300 seconds for other agents to finish
/home/nikiguo93/.local/lib/python3.9/site-packages/torch/distributed/elastic/utils/store.py:70: FutureWarning: This is an experimental API and will be changed in future.
warnings.warn(
[INFO] 2022-06-04 17:20:41,041 api: Done waiting for other agents. Elapsed: 0.0005986690521240234 seconds
{“name”: “torchelastic.worker.status.TERMINATED”, “source”: “WORKER”, “timestamp”: 0, “metadata”: {“run_id”: “none”, “global_rank”: 0, “group_rank”: 0, “worker_id”: “681173”, “role”: “default”, “hostname”: “smicro01.physik.fu-berlin.de”, “state”: “TERMINATED”, “total_run_time”: 1602, “rdzv_backend”: “static”, “raw_error”: null, “metadata”: “{“group_world_size”: 1, “entry_point”: “python3”, “local_rank”: [0], “role_rank”: [0], “role_world_size”: [4]}”, “agent_restarts”: 0}}
{“name”: “torchelastic.worker.status.TERMINATED”, “source”: “WORKER”, “timestamp”: 0, “metadata”: {“run_id”: “none”, “global_rank”: 1, “group_rank”: 0, “worker_id”: “681174”, “role”: “default”, “hostname”: “smicro01.physik.fu-berlin.de”, “state”: “TERMINATED”, “total_run_time”: 1602, “rdzv_backend”: “static”, “raw_error”: null, “metadata”: “{“group_world_size”: 1, “entry_point”: “python3”, “local_rank”: [1], “role_rank”: [1], “role_world_size”: [4]}”, “agent_restarts”: 0}}
{“name”: “torchelastic.worker.status.FAILED”, “source”: “WORKER”, “timestamp”: 0, “metadata”: {“run_id”: “none”, “global_rank”: 2, “group_rank”: 0, “worker_id”: “681175”, “role”: “default”, “hostname”: “smicro01.physik.fu-berlin.de”, “state”: “FAILED”, “total_run_time”: 1602, “rdzv_backend”: “static”, “raw_error”: “{“message”: “”}”, “metadata”: “{“group_world_size”: 1, “entry_point”: “python3”, “local_rank”: [2], “role_rank”: [2], “role_world_size”: [4]}”, “agent_restarts”: 0}}
{“name”: “torchelastic.worker.status.TERMINATED”, “source”: “WORKER”, “timestamp”: 0, “metadata”: {“run_id”: “none”, “global_rank”: 3, “group_rank”: 0, “worker_id”: “681176”, “role”: “default”, “hostname”: “smicro01.physik.fu-berlin.de”, “state”: “TERMINATED”, “total_run_time”: 1602, “rdzv_backend”: “static”, “raw_error”: null, “metadata”: “{“group_world_size”: 1, “entry_point”: “python3”, “local_rank”: [3], “role_rank”: [3], “role_world_size”: [4]}”, “agent_restarts”: 0}}
{“name”: “torchelastic.worker.status.SUCCEEDED”, “source”: “AGENT”, “timestamp”: 0, “metadata”: {“run_id”: “none”, “global_rank”: null, “group_rank”: 0, “worker_id”: null, “role”: “default”, “hostname”: “smicro01.physik.fu-berlin.de”, “state”: “SUCCEEDED”, “total_run_time”: 1602, “rdzv_backend”: “static”, “raw_error”: null, “metadata”: “{“group_world_size”: 1, “entry_point”: “python3”}”, “agent_restarts”: 0}}
/home/nikiguo93/.local/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py:354: UserWarning:

           CHILD PROCESS FAILED WITH NO ERROR_FILE

CHILD PROCESS FAILED WITH NO ERROR_FILE
Child process 681175 (local_rank 2) FAILED (exitcode 1)
Error msg: Process failed with exitcode 1
Without writing an error file to <N/A>.
While this DOES NOT affect the correctness of your application,
no trace information about the error will be available for inspection.
Consider decorating your top level entrypoint function with
torch.distributed.elastic.multiprocessing.errors.record. Example:

from torch.distributed.elastic.multiprocessing.errors import record

@record
def trainer_main(args):
# do train

warnings.warn(_no_error_file_warning_msg(rank, failure))
Traceback (most recent call last):
File “/usr/lib/python3.9/runpy.py”, line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File “/usr/lib/python3.9/runpy.py”, line 87, in _run_code
exec(code, run_globals)
File “/home/nikiguo93/.local/lib/python3.9/site-packages/torch/distributed/run.py”, line 637, in
main()
File “/home/nikiguo93/.local/lib/python3.9/site-packages/torch/distributed/run.py”, line 629, in main
run(args)
File “/home/nikiguo93/.local/lib/python3.9/site-packages/torch/distributed/run.py”, line 621, in run
elastic_launch(
File “/home/nikiguo93/.local/lib/python3.9/site-packages/torch/distributed/launcher/api.py”, line 116, in call
return launch_agent(self._config, self._entrypoint, list(args))
File “/home/nikiguo93/.local/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py”, line 348, in wrapper
return f(*args, **kwargs)
File “/home/nikiguo93/.local/lib/python3.9/site-packages/torch/distributed/launcher/api.py”, line 245, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

scratch_multitask_Imagenet.py FAILED

Root Cause:
[0]:
time: 2022-06-04_17:20:39
rank: 2 (local_rank: 2)
exitcode: 1 (pid: 681175)
error_file: <N/A>
msg: “Process failed with exitcode 1”

Other Failures:
<NO_OTHER_FAILURES>

Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.

ptrblck · June 4, 2022, 11:15pm

The error is caused by:

File “/home/nikiguo93/scratch_multitask_Imagenet.py”, line 109, in getitem
TheImage = Image.open(self.ImageNames[idx]).convert(“RGB”)
File “/usr/lib/python3/dist-packages/PIL/Image.py”, line 2904, in open
fp = builtins.open(filename, “rb”)
IsADirectoryError: [Errno 21] Is a directory: ‘/local_scratch/nikiguo93/train/hpc-work/Codebase’

as the dataset cannot open this file.
The DDP run just re-raises this error so make sure you are able to properly load all samples in the dataset.

NCCL timed out when using the torch.distributed.run

scratch_multitask_Imagenet.py FAILED

Root Cause: [0]: time: 2022-06-04_17:20:39 rank: 2 (local_rank: 2) exitcode: 1 (pid: 681175) error_file: <N/A> msg: “Process failed with exitcode 1”

Root Cause:
[0]:
time: 2022-06-04_17:20:39
rank: 2 (local_rank: 2)
exitcode: 1 (pid: 681175)
error_file: <N/A>
msg: “Process failed with exitcode 1”