Multi-gpu training crashes in A6000

Hi,

I am trying to train dino with 2 A6000 gpus. The code works fine when I train on a single gpu but crashes when I use 2 gpus. My python version is 3.8.11, pytorch version is 1.9.0, torch.version.cuda: 11.1.
Does anyone have any idea how to debug this error or to solve this problem? Thanks in Advance!

Command:
python -m torch.distributed.launch --nproc_per_node=2 main_dino.py --arch resnet50 --optimizer sgd --weight_decay 1e-4 --weight_decay_end 1e-4 --global_crops_scale 0.14 1 --local_crops_scale 0.05 0.14 --data_path /home/ma/uname/dataset/imagenet/train --output_dir /temp

Error message:
/home/ma/uname/maanaconda3/envs/mdetr/lib/python3.8/site-packages/torch/distributed/launch.py:163: DeprecationWarning: The ‘warn’ method is deprecated, use ‘warning’ instead
logger.warn(
The module torch.distributed.launch is deprecated and going to be removed in future.Migrate to torch.distributed.run


Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.


WARNING:torch.distributed.run:–use_env is deprecated and will be removed in future releases.
Please read local_rank from os.environ('LOCAL_RANK') instead.
INFO:torch.distributed.launcher.api:Starting elastic_operator with launch configs:
entrypoint : main_dino.py
min_nodes : 1
max_nodes : 1
nproc_per_node : 2
run_id : none
rdzv_backend : static
rdzv_endpoint : 127.0.0.1:29500
rdzv_configs : {‘rank’: 0, ‘timeout’: 900}
max_restarts : 3
monitor_interval : 5
log_dir : None
metrics_cfg : {}

INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /tmp/torchelastic_q72lm7ip/none_glcjvhtd
INFO:torch.distributed.elastic.agent.server.api:[default] starting workers for entrypoint: python
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous’ing worker group
/home/ma/uname/maanaconda3/envs/mdetr/lib/python3.8/site-packages/torch/distributed/elastic/utils/store.py:52: FutureWarning: This is an experimental API and will be changed in future.
warnings.warn(
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
restart_count=0
master_addr=127.0.0.1
master_port=29500
group_rank=0
group_world_size=1
local_ranks=[0, 1]
role_ranks=[0, 1]
global_ranks=[0, 1]
role_world_sizes=[2, 2]
global_world_sizes=[2, 2]

INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_q72lm7ip/none_glcjvhtd/attempt_0/0/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker1 reply file to: /tmp/torchelastic_q72lm7ip/none_glcjvhtd/attempt_0/1/error.json
Using cache found in /home/ma/uname/.cache/torch/hub/facebookresearch_xcit_master
Using cache found in /home/ma/uname/.cache/torch/hub/facebookresearch_xcit_master
| distributed init (rank 0): env://
| distributed init (rank 1): env://
[W ProcessGroupNCCL.cpp:1569] Rank 0 using best-guess GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
[W ProcessGroupNCCL.cpp:1569] Rank 1 using best-guess GPU 1 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.

[E ProcessGroupNCCL.cpp:566] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1807903 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:566] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1807905 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:325] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of ‘std::runtime_error’
what(): [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1807903 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:325] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of ‘std::runtime_error’
what(): [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1807905 milliseconds before timing out.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 23796) of binary: /home/ma/uname/maanaconda3/envs/mdetr/bin/python
ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed
INFO:torch.distributed.elastic.agent.server.api:[default] Worker group FAILED. 3/3 attempts left; will restart worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Stopping worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous’ing worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
restart_count=1
master_addr=127.0.0.1
master_port=29500
group_rank=0
group_world_size=1
local_ranks=[0, 1]
role_ranks=[0, 1]
global_ranks=[0, 1]
role_world_sizes=[2, 2]
global_world_sizes=[2, 2]

INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_q72lm7ip/none_glcjvhtd/attempt_1/0/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker1 reply file to: /tmp/torchelastic_q72lm7ip/none_glcjvhtd/attempt_1/1/error.json
Using cache found in /home/ma/uname/.cache/torch/hub/facebookresearch_xcit_master
Using cache found in /home/ma/uname/.cache/torch/hub/facebookresearch_xcit_master
Traceback (most recent call last):
File “main_dino.py”, line 472, in
train_dino(args)
File “main_dino.py”, line 134, in train_dino
utils.init_distributed_mode(args)
File “/home/ma/uname/code/dino_orig/utils.py”, line 468, in init_distributed_mode
Traceback (most recent call last):
File “main_dino.py”, line 472, in
train_dino(args)
File “main_dino.py”, line 134, in train_dino
utils.init_distributed_mode(args)
File “/home/ma/uname/code/dino_orig/utils.py”, line 468, in init_distributed_mode
dist.init_process_group(
File “/home/ma/uname/maanaconda3/envs/mdetr/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py”, line 547, in init_process_group
dist.init_process_group(
File “/home/ma/uname/maanaconda3/envs/mdetr/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py”, line 547, in init_process_group
_store_based_barrier(rank, store, timeout)
File “/home/ma/uname/maanaconda3/envs/mdetr/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py”, line 219, in _store_based_barrier
_store_based_barrier(rank, store, timeout)
File “/home/ma/uname/maanaconda3/envs/mdetr/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py”, line 219, in _store_based_barrier
raise RuntimeError(
RuntimeError: Timed out initializing process group in store based barrier on rank: 0, for key: store_based_barrier_key:1 (world_size=2, worker_count=4, timeout=0:30:00)
raise RuntimeError(
RuntimeError: Timed out initializing process group in store based barrier on rank: 1, for key: store_based_barrier_key:1 (world_size=2, worker_count=4, timeout=0:30:00)
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 23980) of binary: /home/ma/uname/maanaconda3/envs/mdetr/bin/python
ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed
INFO:torch.distributed.elastic.agent.server.api:[default] Worker group FAILED. 2/3 attempts left; will restart worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Stopping worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous’ing worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
restart_count=2
master_addr=127.0.0.1
master_port=29500
group_rank=0
group_world_size=1
local_ranks=[0, 1]
role_ranks=[0, 1]
global_ranks=[0, 1]
role_world_sizes=[2, 2]
global_world_sizes=[2, 2]

INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_q72lm7ip/none_glcjvhtd/attempt_2/0/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker1 reply file to: /tmp/torchelastic_q72lm7ip/none_glcjvhtd/attempt_2/1/error.json
Using cache found in /home/ma/uname/.cache/torch/hub/facebookresearch_xcit_master
Using cache found in /home/ma/uname/.cache/torch/hub/facebookresearch_xcit_master
Traceback (most recent call last):
File “main_dino.py”, line 472, in
train_dino(args)
File “main_dino.py”, line 134, in train_dino
utils.init_distributed_mode(args)
File “/home/ma/uname/code/dino_orig/utils.py”, line 468, in init_distributed_mode
dist.init_process_group(
File “/home/ma/uname/maanaconda3/envs/mdetr/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py”, line 547, in init_process_group
_store_based_barrier(rank, store, timeout)
File “/home/ma/uname/maanaconda3/envs/mdetr/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py”, line 219, in _store_based_barrier
raise RuntimeError(
RuntimeError: Timed out initializing process group in store based barrier on rank: 0, for key: store_based_barrier_key:1 (world_size=2, worker_count=6, timeout=0:30:00)
Traceback (most recent call last):
File “main_dino.py”, line 472, in
train_dino(args)
File “main_dino.py”, line 134, in train_dino
utils.init_distributed_mode(args)
File “/home/ma/uname/code/dino_orig/utils.py”, line 468, in init_distributed_mode
dist.init_process_group(
File “/home/ma/uname/maanaconda3/envs/mdetr/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py”, line 547, in init_process_group
_store_based_barrier(rank, store, timeout)
File “/home/ma/uname/maanaconda3/envs/mdetr/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py”, line 219, in _store_based_barrier
raise RuntimeError(
RuntimeError: Timed out initializing process group in store based barrier on rank: 1, for key: store_based_barrier_key:1 (world_size=2, worker_count=6, timeout=0:30:00)
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 24138) of binary: /home/ma/uname/maanaconda3/envs/mdetr/bin/python
ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed
INFO:torch.distributed.elastic.agent.server.api:[default] Worker group FAILED. 1/3 attempts left; will restart worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Stopping worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous’ing worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
restart_count=3
master_addr=127.0.0.1
master_port=29500
group_rank=0
group_world_size=1
local_ranks=[0, 1]
role_ranks=[0, 1]
global_ranks=[0, 1]
role_world_sizes=[2, 2]
global_world_sizes=[2, 2]

INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_q72lm7ip/none_glcjvhtd/attempt_3/0/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker1 reply file to: /tmp/torchelastic_q72lm7ip/none_glcjvhtd/attempt_3/1/error.json
Using cache found in /home/ma/uname/.cache/torch/hub/facebookresearch_xcit_master
Using cache found in /home/ma/uname/.cache/torch/hub/facebookresearch_xcit_master
Traceback (most recent call last):
Traceback (most recent call last):
File “main_dino.py”, line 472, in
File “main_dino.py”, line 472, in
train_dino(args)
File “main_dino.py”, line 134, in train_dino
train_dino(args)
File “main_dino.py”, line 134, in train_dino
utils.init_distributed_mode(args)
File “/home/ma/uname/code/dino_orig/utils.py”, line 468, in init_distributed_mode
utils.init_distributed_mode(args)
File “/home/ma/uname/code/dino_orig/utils.py”, line 468, in init_distributed_mode
dist.init_process_group(
File “/home/ma/uname/maanaconda3/envs/mdetr/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py”, line 547, in init_process_group
dist.init_process_group(
File “/home/ma/uname/maanaconda3/envs/mdetr/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py”, line 547, in init_process_group
_store_based_barrier(rank, store, timeout)
File “/home/ma/uname/maanaconda3/envs/mdetr/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py”, line 219, in _store_based_barrier
_store_based_barrier(rank, store, timeout)
File “/home/ma/uname/maanaconda3/envs/mdetr/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py”, line 219, in _store_based_barrier
raise RuntimeError(
RuntimeError: Timed out initializing process group in store based barrier on rank: 0, for key: store_based_barrier_key:1 (world_size=2, worker_count=8, timeout=0:30:00)
raise RuntimeError(
RuntimeError: Timed out initializing process group in store based barrier on rank: 1, for key: store_based_barrier_key:1 (world_size=2, worker_count=8, timeout=0:30:00)
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 24711) of binary: /home/ma/uname/maanaconda3/envs/mdetr/bin/python
ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed
INFO:torch.distributed.elastic.agent.server.api:Local worker group finished (FAILED). Waiting 300 seconds for other agents to finish
/home/ma/uname/maanaconda3/envs/mdetr/lib/python3.8/site-packages/torch/distributed/elastic/utils/store.py:70: FutureWarning: This is an experimental API and will be changed in future.
warnings.warn(
INFO:torch.distributed.elastic.agent.server.api:Done waiting for other agents. Elapsed: 0.0004601478576660156 seconds
{“name”: “torchelastic.worker.status.FAILED”, “source”: “WORKER”, “timestamp”: 0, “metadata”: {“run_id”: “none”, “global_rank”: 0, “group_rank”: 0, “worker_id”: “24711”, “role”: “default”, “hostname”: “ma-gpu04”, “state”: “FAILED”, “total_run_time”: 7237, “rdzv_backend”: “static”, “raw_error”: “{“message”: “”}”, “metadata”: “{“group_world_size”: 1, “entry_point”: “python”, “local_rank”: [0], “role_rank”: [0], “role_world_size”: [2]}”, “agent_restarts”: 3}}
{“name”: “torchelastic.worker.status.FAILED”, “source”: “WORKER”, “timestamp”: 0, “metadata”: {“run_id”: “none”, “global_rank”: 1, “group_rank”: 0, “worker_id”: “24712”, “role”: “default”, “hostname”: “ma-gpu04”, “state”: “FAILED”, “total_run_time”: 7237, “rdzv_backend”: “static”, “raw_error”: “{“message”: “”}”, “metadata”: “{“group_world_size”: 1, “entry_point”: “python”, “local_rank”: [1], “role_rank”: [1], “role_world_size”: [2]}”, “agent_restarts”: 3}}
{“name”: “torchelastic.worker.status.SUCCEEDED”, “source”: “AGENT”, “timestamp”: 0, “metadata”: {“run_id”: “none”, “global_rank”: null, “group_rank”: 0, “worker_id”: null, “role”: “default”, “hostname”: “ma-gpu04”, “state”: “SUCCEEDED”, “total_run_time”: 7237, “rdzv_backend”: “static”, “raw_error”: null, “metadata”: “{“group_world_size”: 1, “entry_point”: “python”}”, “agent_restarts”: 3}}
/home/ma/uname/maanaconda3/envs/mdetr/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py:354: UserWarning:


           CHILD PROCESS FAILED WITH NO ERROR_FILE

CHILD PROCESS FAILED WITH NO ERROR_FILE
Child process 24711 (local_rank 0) FAILED (exitcode 1)
Error msg: Process failed with exitcode 1
Without writing an error file to <N/A>.
While this DOES NOT affect the correctness of your application,
no trace information about the error will be available for inspection.
Consider decorating your top level entrypoint function with
torch.distributed.elastic.multiprocessing.errors.record. Example:

from torch.distributed.elastic.multiprocessing.errors import record

@record
def trainer_main(args):
# do train


warnings.warn(_no_error_file_warning_msg(rank, failure))
Traceback (most recent call last):
File “/home/ma/uname/maanaconda3/envs/mdetr/lib/python3.8/runpy.py”, line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File “/home/ma/uname/maanaconda3/envs/mdetr/lib/python3.8/runpy.py”, line 87, in _run_code
exec(code, run_globals)
File “/home/ma/uname/maanaconda3/envs/mdetr/lib/python3.8/site-packages/torch/distributed/launch.py”, line 173, in
main()
File “/home/ma/uname/maanaconda3/envs/mdetr/lib/python3.8/site-packages/torch/distributed/launch.py”, line 169, in main
run(args)
File “/home/ma/uname/maanaconda3/envs/mdetr/lib/python3.8/site-packages/torch/distributed/run.py”, line 621, in run
elastic_launch(
File “/home/ma/uname/maanaconda3/envs/mdetr/lib/python3.8/site-packages/torch/distributed/launcher/api.py”, line 116, in call
return launch_agent(self._config, self._entrypoint, list(args))
File “/home/ma/uname/maanaconda3/envs/mdetr/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py”, line 348, in wrapper
return f(*args, **kwargs)
File “/home/ma/uname/maanaconda3/envs/mdetr/lib/python3.8/site-packages/torch/distributed/launcher/api.py”, line 245, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:


      main_dino.py FAILED

=======================================
Root Cause:
[0]:
time: 2021-09-17_13:42:15
rank: 0 (local_rank: 0)
exitcode: 1 (pid: 24711)
error_file: <N/A>
msg: “Process failed with exitcode 1”

Other Failures:
[1]:
time: 2021-09-17_13:42:15
rank: 1 (local_rank: 1)
exitcode: 1 (pid: 24712)
error_file: <N/A>
msg: “Process failed with exitcode 1”


Hi,

In the error output, if you look closely, you will see a recommendation to gather more information about the root cause:

You can find more information about the record decorator here. If you decorate your main function with @record you should get a full stack trace of the actual exception.

Having said that I would also suggest running your code with the --max_restarts=0 option. This was a bug in v1.9 that is fixed in v1.9.1. Without that option, your training will be restarted up to 3 times before ultimately failing.

1 Like