Hello
I would like to run torch.distributed
on a HPC cluster. The command I’m using is the following:
CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch --nproc_per_node 2 train.py
I’m using two NVIDIA Quadro RTX 6000 GPUs with 24 GB of memory. train.py
is a Python script and uses Huggingface Trainer to fine-tune a transformer model.
I’m getting the error shown below. Does somebody know how this can be solved?
/cluster/apps/nss/gcc-8.2.0/python/3.10.4/x86_64/lib64/python3.10/site-packages/torch/distributed/launch.py:178: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions
warnings.warn(
Traceback (most recent call last):
File "/cluster/home/username/chatbot/gpt_j/train.py", line 294, in <module>
main(sys.argv[1:])
File "/cluster/home/username/chatbot/gpt_j/train.py", line 64, in main
model = Model()
File "/cluster/home/username/chatbot/gpt_j/model.py", line 43, in __init__
self.model.to(self.device)
File "/cluster/apps/nss/gcc-8.2.0/python/3.10.4/x86_64/lib64/python3.10/site-packages/torch/nn/modules/module.py", line 907, in to
return self._apply(convert)
File "/cluster/apps/nss/gcc-8.2.0/python/3.10.4/x86_64/lib64/python3.10/site-packages/torch/nn/modules/module.py", line 578, in _apply
module._apply(fn)
File "/cluster/apps/nss/gcc-8.2.0/python/3.10.4/x86_64/lib64/python3.10/site-packages/torch/nn/modules/module.py", line 578, in _apply
module._apply(fn)
File "/cluster/apps/nss/gcc-8.2.0/python/3.10.4/x86_64/lib64/python3.10/site-packages/torch/nn/modules/module.py", line 578, in _apply
module._apply(fn)
[Previous line repeated 1 more time]
File "/cluster/apps/nss/gcc-8.2.0/python/3.10.4/x86_64/lib64/python3.10/site-packages/torch/nn/modules/module.py", line 601, in _apply
param_applied = fn(param)
File "/cluster/apps/nss/gcc-8.2.0/python/3.10.4/x86_64/lib64/python3.10/site-packages/torch/nn/modules/module.py", line 905, in convert
return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
RuntimeError: CUDA error: all CUDA-capable devices are busy or unavailable
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
2022-08-20 02:34:24,834 WARNING:Using custom data configuration default-990e072ab094d8c6
Fatal error condition occurred in /opt/vcpkg/buildtrees/aws-c-io/src/9e6648842a-364b708815.clean/source/event_loop.c:72: aws_thread_launch(&cleanup_thread, s_event_loop_destroy_async_thread_fn, el_group, &thread_options) == AWS_OP_SUCCESS
Exiting Application
################################################################################
Stack trace:
################################################################################
/cluster/home/username/.local/lib/python3.10/site-packages/pyarrow/libarrow.so.900(+0x200af06) [0x2ae499958f06]
/cluster/home/username/.local/lib/python3.10/site-packages/pyarrow/libarrow.so.900(+0x20028e5) [0x2ae4999508e5]
/cluster/home/username/.local/lib/python3.10/site-packages/pyarrow/libarrow.so.900(+0x1f27e09) [0x2ae499875e09]
/cluster/home/username/.local/lib/python3.10/site-packages/pyarrow/libarrow.so.900(+0x200ba3d) [0x2ae499959a3d]
/cluster/home/username/.local/lib/python3.10/site-packages/pyarrow/libarrow.so.900(+0x1f25948) [0x2ae499873948]
/cluster/home/username/.local/lib/python3.10/site-packages/pyarrow/libarrow.so.900(+0x200ba3d) [0x2ae499959a3d]
/cluster/home/username/.local/lib/python3.10/site-packages/pyarrow/libarrow.so.900(+0x1ee0b46) [0x2ae49982eb46]
/cluster/home/username/.local/lib/python3.10/site-packages/pyarrow/libarrow.so.900(+0x194546a) [0x2ae49929346a]
/lib64/libc.so.6(+0x39ce9) [0x2ae49001fce9]
/lib64/libc.so.6(+0x39d37) [0x2ae49001fd37]
/lib64/libc.so.6(__libc_start_main+0xfc) [0x2ae49000855c]
/cluster/apps/nss/gcc-8.2.0/python/3.10.4/x86_64/bin/python() [0x4006fe]
2022-08-20 02:34:24,947 WARNING:Reusing dataset text (/cluster/home/username/.cache/huggingface/datasets/text/default-990e072ab094d8c6/0.0.0/21a506d1b2b34316b1e82d0bd79066905d846e5d7e619823c0dd338d6f1fa6ad)
0%| | 0/1 [00:00<?, ?it/s]
100%|██████████| 1/1 [00:00<00:00, 3.57it/s]
100%|██████████| 1/1 [00:00<00:00, 3.56it/s]
2022-08-20 02:34:25,663 WARNING:Using custom data configuration default-e89076d74da83269
2022-08-20 02:34:25,669 WARNING:Reusing dataset text (/cluster/home/username/.cache/huggingface/datasets/text/default-e89076d74da83269/0.0.0/21a506d1b2b34316b1e82d0bd79066905d846e5d7e619823c0dd338d6f1fa6ad)
0%| | 0/1 [00:00<?, ?it/s]
100%|██████████| 1/1 [00:00<00:00, 9.74it/s]
100%|██████████| 1/1 [00:00<00:00, 9.71it/s]
DatasetDict({
train: Dataset({
features: ['text'],
num_rows: 787650
})
validation: Dataset({
features: ['text'],
num_rows: 262548
})
})
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 123732 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 123731) of binary: /cluster/apps/nss/gcc-8.2.0/python/3.10.4/x86_64/bin/python
/cluster/shadow/.lsbatch/1660955521.229195199: line 8: 123724 Segmentation fault (core dumped) CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch --nproc_per_node 2 train.py