Error for run a ready project with pytorch

setare · September 12, 2022, 1:59am

I am working on a ready project https://github.com/microsoft/TAP
This project is about text visual question answering.
My system has multi GPU .Cuda and GPU are available in project (01:00.0 VGA compatible controller: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] (rev a1)) .
and my system operation is ubuntu 22.08.
Now I go step by step with command in project.
After downloading project and installing the package according to the instructions written in the project guide and download dataset and put them in data folder.
In the second step, I start training but in this step I have some problem with distributed and Pytorch and I don’t Know how to solve it(I don’t know if the problem is due to the lack of management resource of my computer system or I can change the code and run it). I think that error is from (torch.distributed.launch)
That command is
python -m torch.distributed.launch --nproc_per_node 4 tools/run.py --pretrain --tasks vqa --datasets m4c_textvqa --model m4c_split --seed 13 --config configs/vqa/m4c_textvqa/tap_base_pretrain.yml --save_dir save/m4c_split_pretrain_test training_parameters.distributed True

I attach that error

(TAP) riv@riv-System-Product-Name:/media/riv/New Volume/kf/TAP$ python -m torch.distributed.launch --nproc_per_node 4 tools/run.py --pretrain --tasks vqa --datasets m4c_textvqa --model m4c_split --seed 13 --config configs/vqa/m4c_textvqa/tap_base_pretrain.yml --save_dir save/m4c_split_pretrain_test training_parameters.distributed True
/home/riv/anaconda3/envs/TAP/lib/python3.6/site-packages/torch/distributed/launch.py:186: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See 
https://pytorch.org/docs/stable/distributed.html#launch-utility for 
further instructions

  FutureWarning,
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
Overriding option training_parameters.distributed to True
You have chosen to seed the training. This will turn on CUDNN deterministic setting which can slow down your training considerably! You may see unexpected behavior when restarting from checkpoints.
Overriding option training_parameters.distributed to True
You have chosen to seed the training. This will turn on CUDNN deterministic setting which can slow down your training considerably! You may see unexpected behavior when restarting from checkpoints.
Overriding option training_parameters.distributed to True
You have chosen to seed the training. This will turn on CUDNN deterministic setting which can slow down your training considerably! You may see unexpected behavior when restarting from checkpoints.
Overriding option training_parameters.distributed to True
You have chosen to seed the training. This will turn on CUDNN deterministic setting which can slow down your training considerably! You may see unexpected behavior when restarting from checkpoints.
Traceback (most recent call last):
  File "tools/run.py", line 92, in <module>
    run()
  File "tools/run.py", line 80, in run
    trainer.load()
  File "/media/riv/New Volume/kf/TAP/pythia/trainers/base_trainer.py", line 33, in load
    self._init_process_group()
  File "/media/riv/New Volume/kf/TAP/pythia/trainers/base_trainer.py", line 63, in _init_process_group
    synchronize()
  File "/media/riv/New Volume/kf/TAP/pythia/utils/distributed_utils.py", line 18, in synchronize
    dist.barrier()
  File "/home/riv/anaconda3/envs/TAP/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 2709, in barrier
    work = default_pg.barrier(opts=opts)
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1640811805959/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:957, invalid usage, NCCL version 21.0.3
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 3070518) of binary: /home/riv/anaconda3/envs/TAP/bin/python
Traceback (most recent call last):
  File "/home/riv/anaconda3/envs/TAP/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/riv/anaconda3/envs/TAP/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/riv/anaconda3/envs/TAP/lib/python3.6/site-packages/torch/distributed/launch.py", line 193, in <module>
    main()
  File "/home/riv/anaconda3/envs/TAP/lib/python3.6/site-packages/torch/distributed/launch.py", line 189, in main
    launch(args)
  File "/home/riv/anaconda3/envs/TAP/lib/python3.6/site-packages/torch/distributed/launch.py", line 174, in launch
    run(args)
  File "/home/riv/anaconda3/envs/TAP/lib/python3.6/site-packages/torch/distributed/run.py", line 713, in run
    )(*cmd_args)
  File "/home/riv/anaconda3/envs/TAP/lib/python3.6/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/riv/anaconda3/envs/TAP/lib/python3.6/site-packages/torch/distributed/launcher/api.py", line 261, in launch_agent
    failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
tools/run.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2022-09-11_21:08:18
  host      : riv-System-Product-Name
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 3070518)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

ptrblck · September 12, 2022, 4:12am

Could you rerun your code with NCCL_DEBUG=INFO and post the output here, please?

setare · September 12, 2022, 5:36am

thank you so much for reply but I may not understand what you mean.Do you mean I run ?
(TAP) riv@riv-System-Product-Name:/media/riv/New Volume/kf/TAP$ NCCL_DEBUG=INFO python -m torch.distributed.launch --nproc_per_node 4 tools/run.py --pretrain --tasks vqa --datasets m4c_textvqa --model m4c_split --seed 13 --config configs/vqa/m4c_textvqa/tap_base_pretrain.yml --save_dir save/m4c_split_pretrain_test training_parameters.distributed True

setare · September 12, 2022, 5:38am

If that run is correct the output is…

[/home/riv/anaconda3/envs/TAP/lib/python3.6/site-packages/torch/distributed/launch.py:186: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See 
https://pytorch.org/docs/stable/distributed.html#launch-utility for 
further instructions

  FutureWarning,
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
Overriding option training_parameters.distributed to True
You have chosen to seed the training. This will turn on CUDNN deterministic setting which can slow down your training considerably! You may see unexpected behavior when restarting from checkpoints.
Overriding option training_parameters.distributed to True
You have chosen to seed the training. This will turn on CUDNN deterministic setting which can slow down your training considerably! You may see unexpected behavior when restarting from checkpoints.
Overriding option training_parameters.distributed to True
You have chosen to seed the training. This will turn on CUDNN deterministic setting which can slow down your training considerably! You may see unexpected behavior when restarting from checkpoints.
Overriding option training_parameters.distributed to True
You have chosen to seed the training. This will turn on CUDNN deterministic setting which can slow down your training considerably! You may see unexpected behavior when restarting from checkpoints.
riv-System-Product-Name:3117279:3117279 [0] NCCL INFO Bootstrap : Using enp0s31f6:192.168.80.11<0>
riv-System-Product-Name:3117279:3117279 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation

riv-System-Product-Name:3117279:3117279 [0] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
riv-System-Product-Name:3117279:3117279 [0] NCCL INFO NET/Socket : Using [0]enp0s31f6:192.168.80.11<0> [1]ppp0:172.22.3.236<0>
riv-System-Product-Name:3117279:3117279 [0] NCCL INFO Using network Socket
NCCL version 2.10.3+cuda11.3
riv-System-Product-Name:3117281:3117281 [0] NCCL INFO Bootstrap : Using enp0s31f6:192.168.80.11<0>
riv-System-Product-Name:3117281:3117281 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation

riv-System-Product-Name:3117281:3117281 [0] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
riv-System-Product-Name:3117281:3117281 [0] NCCL INFO NET/Socket : Using [0]enp0s31f6:192.168.80.11<0> [1]ppp0:172.22.3.236<0>
riv-System-Product-Name:3117281:3117281 [0] NCCL INFO Using network Socket
riv-System-Product-Name:3117280:3117280 [0] NCCL INFO Bootstrap : Using enp0s31f6:192.168.80.11<0>
riv-System-Product-Name:3117280:3117280 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation

riv-System-Product-Name:3117280:3117280 [0] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
riv-System-Product-Name:3117280:3117280 [0] NCCL INFO NET/Socket : Using [0]enp0s31f6:192.168.80.11<0> [1]ppp0:172.22.3.236<0>
riv-System-Product-Name:3117280:3117280 [0] NCCL INFO Using network Socket
riv-System-Product-Name:3117282:3117282 [0] NCCL INFO Bootstrap : Using enp0s31f6:192.168.80.11<0>
riv-System-Product-Name:3117282:3117282 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation

riv-System-Product-Name:3117282:3117282 [0] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
riv-System-Product-Name:3117282:3117282 [0] NCCL INFO NET/Socket : Using [0]enp0s31f6:192.168.80.11<0> [1]ppp0:172.22.3.236<0>
riv-System-Product-Name:3117282:3117282 [0] NCCL INFO Using network Socket

riv-System-Product-Name:3117281:3117334 [0] init.cc:521 NCCL WARN Duplicate GPU detected : rank 2 and rank 0 both on CUDA device 1000
riv-System-Product-Name:3117281:3117334 [0] NCCL INFO init.cc:904 -> 5

riv-System-Product-Name:3117282:3117336 [0] init.cc:521 NCCL WARN Duplicate GPU detected : rank 3 and rank 0 both on CUDA device 1000

riv-System-Product-Name:3117279:3117333 [0] init.cc:521 NCCL WARN Duplicate GPU detected : rank 0 and rank 1 both on CUDA device 1000
riv-System-Product-Name:3117282:3117336 [0] NCCL INFO init.cc:904 -> 5
riv-System-Product-Name:3117279:3117333 [0] NCCL INFO init.cc:904 -> 5
riv-System-Product-Name:3117281:3117334 [0] NCCL INFO group.cc:72 -> 5 [Async thread]
riv-System-Product-Name:3117282:3117336 [0] NCCL INFO group.cc:72 -> 5 [Async thread]
riv-System-Product-Name:3117279:3117333 [0] NCCL INFO group.cc:72 -> 5 [Async thread]

riv-System-Product-Name:3117280:3117335 [0] init.cc:521 NCCL WARN Duplicate GPU detected : rank 1 and rank 0 both on CUDA device 1000
riv-System-Product-Name:3117280:3117335 [0] NCCL INFO init.cc:904 -> 5
riv-System-Product-Name:3117280:3117335 [0] NCCL INFO group.cc:72 -> 5 [Async thread]
Traceback (most recent call last):
  File "tools/run.py", line 92, in <module>
    run()
  File "tools/run.py", line 80, in run
    trainer.load()
  File "/media/riv/New Volume/kf/TAP/pythia/trainers/base_trainer.py", line 33, in load
    self._init_process_group()
  File "/media/riv/New Volume/kf/TAP/pythia/trainers/base_trainer.py", line 63, in _init_process_group
    synchronize()
  File "/media/riv/New Volume/kf/TAP/pythia/utils/distributed_utils.py", line 18, in synchronize
    dist.barrier()
  File "/home/riv/anaconda3/envs/TAP/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 2709, in barrier
    work = default_pg.barrier(opts=opts)
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1640811805959/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:957, invalid usage, NCCL version 21.0.3
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 3117279) of binary: /home/riv/anaconda3/envs/TAP/bin/python
Traceback (most recent call last):
  File "/home/riv/anaconda3/envs/TAP/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/riv/anaconda3/envs/TAP/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/riv/anaconda3/envs/TAP/lib/python3.6/site-packages/torch/distributed/launch.py", line 193, in <module>
    main()
  File "/home/riv/anaconda3/envs/TAP/lib/python3.6/site-packages/torch/distributed/launch.py", line 189, in main
    launch(args)
  File "/home/riv/anaconda3/envs/TAP/lib/python3.6/site-packages/torch/distributed/launch.py", line 174, in launch
    run(args)
  File "/home/riv/anaconda3/envs/TAP/lib/python3.6/site-packages/torch/distributed/run.py", line 713, in run
    )(*cmd_args)
  File "/home/riv/anaconda3/envs/TAP/lib/python3.6/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/riv/anaconda3/envs/TAP/lib/python3.6/site-packages/torch/distributed/launcher/api.py", line 261, in launch_agent
    failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
tools/run.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2022-09-12_01:31:00
  host      : riv-System-Product-Name
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 3117279)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html](https://)

ptrblck · September 12, 2022, 5:42am

Thanks for the update. It seems your DDP script tries to reuse the same device:

init.cc:521 NCCL WARN Duplicate GPU detected : rank 2 and rank 0 both on CUDA device 1000

and is thus crashing.

setare · September 12, 2022, 6:03am

so how to solve it؟ What do you suggest? Could you give me code for rerun? I am using a computer with feature of 8 GPU GeForce GTX 1080 Ti and that GPU are available in project… Does it mean I should use multi system together?

ptrblck · September 12, 2022, 6:42am

I would recommend to first run the DDP tutorial and to make sure it can be properly executed. Afterwards, check your script to see if other ranks are using the default device (GPU0) somewhere in their code. Adding torch.cuda.set_device in your script might help.

setare · September 19, 2022, 6:00am

Hi, thanks for taking time and mentioning these useful tips . I am very sorry for the late reply cause I was checking my computer and source code. I realized that my code is ok but my computer has only one gpu( geforce GTX Ti) and one cpu, may its that cause of these errors? will I still be able to run this program(run with nccl and get rank, get world size) due to having only one gpu with this code? If yes could you give me some suggestion ?

setare · September 19, 2022, 6:46am

Hi, thanks for taking time and mentioning these useful tips . I am very sorry for the late reply cause I was checking my computer and source code. I realized that my code is ok but my computer has only one gpu( geforce GTX Ti) and one cpu, may its that cause of these errors? will I still be able to run this program(run with nccl and get rank, get world size) due to having only one gpu with this code? If yes could you give me some suggestion ?

ptrblck · September 19, 2022, 8:58am

It wouldn’t make sense to run a data parallel use case with a single GPU. NCCL might be able to return the single rank etc. which won’t be usable afterwards. As you can see from your original error message, it’ll raise an error if you try to initialize different ranks on the same device.
Remove the DDP usage in this case and use the single GPU only on this server.

hieuchi911 · May 8, 2024, 2:55am

Hi @ptrblck, I hope this message finds you well! I have a question regarding similar problem in this thread.
I’m running the DDP tutorial, specifically the multi node example. The sbatch script that I used looks like this:

#!/bin/bash

#SBATCH --job-name=multinode
#SBATCH --account=yzhao010_1246
#SBATCH --partition=gpu
#SBATCH --mem=32GB
#SBATCH --nodes=2
#SBATCH --ntasks=4
#SBATCH --gres=gpu:p100:2
#SBATCH --cpus-per-task=4

source ~/.bashrc

module purge
conda activate hallu

nodes=( $( scontrol show hostnames $SLURM_JOB_NODELIST ) )
nodes_array=($nodes)
head_node=${nodes_array[0]}
head_node_ip=$(srun --nodes=1 --ntasks=1 -w "$head_node" hostname --ip-address)

echo Node IP: $head_node_ip
export LOGLEVEL=INFO

srun torchrun \
--nnodes 2 \
--nproc_per_node 2 \
--rdzv_id $RANDOM \
--rdzv_backend c10d \
--rdzv_endpoint $head_node_ip:29500 \
multinode.py 50 10

Interestingly, many of the times I’m getting similar problem: [rank3]: Duplicate GPU detected : rank 3 and rank 1 both on CUDA device 81000, but other times this doesn’t happen. Do you have any suggestion on how I should go about this problem?

ptrblck · May 8, 2024, 2:16pm

Since the issue is non-deterministic I would guess the rank setup fails somehow. Did you print the node IPs etc. making sure different nodes are picked in slurm?

hieuchi911 · May 9, 2024, 12:10am

@ptrblck Thank you for the prompt reply, I used:

node_ips=$(srun --nodes=2 --ntasks=2 -w "$head_node" hostname --ip-address)
echo Node IPs: $node_ips

and the output is: Node IPs: 10.125.18.112 10.125.18.111

hieuchi911 · May 9, 2024, 12:36am

I think the problem might be from torchrun (the default device (GPU0) thingy?). The following srun works:

srun --nodes=2 --ntasks=4 torchrun \
--nproc_per_node 2 \
--rdzv_id $RANDOM \
--rdzv_backend c10d \
--rdzv_endpoint $head_node_ip:29500 \
multinode.py 50 10

Here I moved --nnodes 2 from torchrun outside, and specify the total number of tasks to run for srun

ptrblck · May 9, 2024, 2:05pm

I thought so initially too, but doubt it as it should then always fail. The “default device thingy” is a user code error where the wrong (default) device is used for operations instead of the desired rank and I wouldn’t know how this could be random.