I am working on a ready project https://github.com/microsoft/TAP
This project is about text visual question answering.
My system has multi GPU .Cuda and GPU are available in project (01:00.0 VGA compatible controller: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] (rev a1))
.
and my system operation is ubuntu 22.08.
Now I go step by step with command in project.
After downloading project and installing the package according to the instructions written in the project guide and download dataset and put them in data folder.
In the second step, I start training but in this step I have some problem with distributed and Pytorch and I don’t Know how to solve it(I don’t know if the problem is due to the lack of management resource of my computer system or I can change the code and run it). I think that error is from (torch.distributed.launch)
That command is
python -m torch.distributed.launch --nproc_per_node 4 tools/run.py --pretrain --tasks vqa --datasets m4c_textvqa --model m4c_split --seed 13 --config configs/vqa/m4c_textvqa/tap_base_pretrain.yml --save_dir save/m4c_split_pretrain_test training_parameters.distributed True
I attach that error
(TAP) riv@riv-System-Product-Name:/media/riv/New Volume/kf/TAP$ python -m torch.distributed.launch --nproc_per_node 4 tools/run.py --pretrain --tasks vqa --datasets m4c_textvqa --model m4c_split --seed 13 --config configs/vqa/m4c_textvqa/tap_base_pretrain.yml --save_dir save/m4c_split_pretrain_test training_parameters.distributed True
/home/riv/anaconda3/envs/TAP/lib/python3.6/site-packages/torch/distributed/launch.py:186: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions
FutureWarning,
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
Overriding option training_parameters.distributed to True
You have chosen to seed the training. This will turn on CUDNN deterministic setting which can slow down your training considerably! You may see unexpected behavior when restarting from checkpoints.
Overriding option training_parameters.distributed to True
You have chosen to seed the training. This will turn on CUDNN deterministic setting which can slow down your training considerably! You may see unexpected behavior when restarting from checkpoints.
Overriding option training_parameters.distributed to True
You have chosen to seed the training. This will turn on CUDNN deterministic setting which can slow down your training considerably! You may see unexpected behavior when restarting from checkpoints.
Overriding option training_parameters.distributed to True
You have chosen to seed the training. This will turn on CUDNN deterministic setting which can slow down your training considerably! You may see unexpected behavior when restarting from checkpoints.
Traceback (most recent call last):
File "tools/run.py", line 92, in <module>
run()
File "tools/run.py", line 80, in run
trainer.load()
File "/media/riv/New Volume/kf/TAP/pythia/trainers/base_trainer.py", line 33, in load
self._init_process_group()
File "/media/riv/New Volume/kf/TAP/pythia/trainers/base_trainer.py", line 63, in _init_process_group
synchronize()
File "/media/riv/New Volume/kf/TAP/pythia/utils/distributed_utils.py", line 18, in synchronize
dist.barrier()
File "/home/riv/anaconda3/envs/TAP/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 2709, in barrier
work = default_pg.barrier(opts=opts)
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1640811805959/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:957, invalid usage, NCCL version 21.0.3
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 3070518) of binary: /home/riv/anaconda3/envs/TAP/bin/python
Traceback (most recent call last):
File "/home/riv/anaconda3/envs/TAP/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/home/riv/anaconda3/envs/TAP/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/riv/anaconda3/envs/TAP/lib/python3.6/site-packages/torch/distributed/launch.py", line 193, in <module>
main()
File "/home/riv/anaconda3/envs/TAP/lib/python3.6/site-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/home/riv/anaconda3/envs/TAP/lib/python3.6/site-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/home/riv/anaconda3/envs/TAP/lib/python3.6/site-packages/torch/distributed/run.py", line 713, in run
)(*cmd_args)
File "/home/riv/anaconda3/envs/TAP/lib/python3.6/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/riv/anaconda3/envs/TAP/lib/python3.6/site-packages/torch/distributed/launcher/api.py", line 261, in launch_agent
failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
tools/run.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2022-09-11_21:08:18
host : riv-System-Product-Name
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 3070518)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html