I meet the following error when I use torchtune to train a model CUDA_VISIBLE_DEVICES=4,5,6,7 tune run --nproc_per_node 4 lora_finetune_distributed --config llama3_1/8B_lora.yaml
.
the process exit at work = default_pg.broadcast([tensor], opts) (L2417, torch.distributed.distributed_c10d.py)
of random._rng_tracker = OffsetBasedRNGTracker(device_type)(L685, torch.distributed.tensor._api.py)
of distribute_tensor(L338, torchtune/training/_distributed.py)
of training.load_from_full_model_state_dict(L503, lora_finetune_distributed.py)
, without any other error messages.
Here are the detailed outputs:
[I1022 17:07:37.797161243 debug.cpp:49] [c10d] The debug level is set to DETAIL.
[I1022 17:07:39.836860510 debug.cpp:49] [c10d] The debug level is set to DETAIL.
Running with torchrun...
Namespace(func=<bound method Run._run_cmd of <torchtune._cli.run.Run object at 0x7fe27ff10940>>, nnodes='1:1', nproc_per_node='4', rdzv_backend='static', rdzv_endpoint='', rdzv_id='none', rdzv_conf='', standalone=False, max_restarts=0, monitor_interval=0.1, start_method='spawn', role='default', module=False, no_python=False, run_path=False, log_dir=None, redirects='0', tee='0', local_ranks_filter='', node_rank=0, master_addr='127.0.0.1', master_port=29500, local_addr=None, logs_specs=None, recipe='/data/user/shared/torchtune/recipes/lora_finetune_distributed.py', recipe_args=['--config', '/home/user/shared/torchtune/recipes/configs/llama3_1/8B_lora.yaml'], training_script='/data/user/shared/torchtune/recipes/lora_finetune_distributed.py', training_script_args=['--config', '/home/user/shared/torchtune/recipes/configs/llama3_1/8B_lora.yaml'])
W1022 17:07:40.830000 2790913 /data/user/miniconda3/envs/torchtune/lib/python3.10/site-packages/torch/distributed/run.py:793]
W1022 17:07:40.830000 2790913 /data/user/miniconda3/envs/torchtune/lib/python3.10/site-packages/torch/distributed/run.py:793] *****************************************
W1022 17:07:40.830000 2790913 /data/user/miniconda3/envs/torchtune/lib/python3.10/site-packages/torch/distributed/run.py:793] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W1022 17:07:40.830000 2790913 /data/user/miniconda3/envs/torchtune/lib/python3.10/site-packages/torch/distributed/run.py:793] *****************************************
[I1022 17:07:40.696906459 TCPStore.cpp:298] [c10d - debug] The server has started on port = 29500.
[I1022 17:07:40.696954905 TCPStoreLibUvBackend.cpp:1100] [c10d - debug] Uv main loop running
[I1022 17:07:40.697016370 socket.cpp:773] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (127.0.0.1, 29500).
[I1022 17:07:40.697062480 socket.cpp:847] [c10d - trace] The client socket is attempting to connect to [::ffff:127.0.0.1]:29500.
[I1022 17:07:40.697840892 socket.cpp:938] [c10d] The client socket has connected to [::ffff:127.0.0.1]:29500 on SocketImpl(fd=72, addr=[::ffff:127.0.0.1]:45702, remote=[::ffff:127.0.0.1]:29500).
[I1022 17:07:40.698006414 TCPStore.cpp:334] [c10d - debug] TCP client connected to host 127.0.0.1:29500
[I1022 17:07:41.356534965 debug.cpp:49] [c10d] The debug level is set to DETAIL.
[I1022 17:07:41.356535002 debug.cpp:49] [c10d] The debug level is set to DETAIL.
[I1022 17:07:41.358519543 debug.cpp:49] [c10d] The debug level is set to DETAIL.
[I1022 17:07:41.360214939 debug.cpp:49] [c10d] The debug level is set to DETAIL.
[I1022 17:07:43.380146172 debug.cpp:49] [c10d] The debug level is set to DETAIL.
[I1022 17:07:43.413592241 debug.cpp:49] [c10d] The debug level is set to DETAIL.
[I1022 17:07:43.425319630 debug.cpp:49] [c10d] The debug level is set to DETAIL.
[I1022 17:07:43.428458372 debug.cpp:49] [c10d] The debug level is set to DETAIL.
[I1022 17:07:44.217347549 socket.cpp:773] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (127.0.0.1, 29500).
[I1022 17:07:44.217418420 socket.cpp:847] [c10d - trace] The client socket is attempting to connect to [::ffff:127.0.0.1]:29500.
[I1022 17:07:44.218081149 socket.cpp:938] [c10d] The client socket has connected to [::ffff:127.0.0.1]:29500 on SocketImpl(fd=50, addr=[::ffff:127.0.0.1]:35816, remote=[::ffff:127.0.0.1]:29500).
[I1022 17:07:44.218207459 TCPStore.cpp:334] [c10d - debug] TCP client connected to host 127.0.0.1:29500
[I1022 17:07:44.218792363 ProcessGroupNCCL.cpp:905] [PG ID 0 PG GUID 0 Rank 0] ProcessGroupNCCL initialization options: size: 4, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0, SPLIT_COLOR: 0, PG Name: 0
[I1022 17:07:44.218819744 ProcessGroupNCCL.cpp:914] [PG ID 0 PG GUID 0 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.21.5, TORCH_NCCL_ASYNC_ERROR_HANDLING: 1, TORCH_NCCL_DUMP_ON_TIMEOUT: 1, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 1, TORCH_NCCL_ENABLE_TIMING: 1, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: DETAIL, TORCH_NCCL_USE_TENSOR_REGISTER_ALLOCATOR_HOOK: 0, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 480, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0, TORCH_NCCL_CUDA_EVENT_CACHE: 0, TORCH_NCCL_LOG_CPP_STACK_ON_UNCLEAN_SHUTDOWN: 1
[I1022 17:07:44.320745613 socket.cpp:773] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (127.0.0.1, 29500).
[I1022 17:07:44.320819673 socket.cpp:773] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (127.0.0.1, 29500).
[I1022 17:07:44.320876774 socket.cpp:847] [c10d - trace] The client socket is attempting to connect to [::ffff:127.0.0.1]:29500.
[I1022 17:07:44.320818546 socket.cpp:847] [c10d - trace] The client socket is attempting to connect to [::ffff:127.0.0.1]:29500.
[I1022 17:07:44.321534202 socket.cpp:938] [c10d] The client socket has connected to [::ffff:127.0.0.1]:29500 on SocketImpl(fd=50, addr=[::ffff:127.0.0.1]:35818, remote=[::ffff:127.0.0.1]:29500).
[I1022 17:07:44.321666135 TCPStore.cpp:334] [c10d - debug] TCP client connected to host 127.0.0.1:29500
[I1022 17:07:44.321551946 socket.cpp:938] [c10d] The client socket has connected to [::ffff:127.0.0.1]:29500 on SocketImpl(fd=50, addr=[::ffff:127.0.0.1]:35820, remote=[::ffff:127.0.0.1]:29500).
[I1022 17:07:44.321683112 TCPStore.cpp:334] [c10d - debug] TCP client connected to host 127.0.0.1:29500
[I1022 17:07:44.322037997 ProcessGroupNCCL.cpp:905] [PG ID 0 PG GUID 0 Rank 2] ProcessGroupNCCL initialization options: size: 4, global rank: 2, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0, SPLIT_COLOR: 0, PG Name: 0
[I1022 17:07:44.322040044 ProcessGroupNCCL.cpp:905] [PG ID 0 PG GUID 0 Rank 3] ProcessGroupNCCL initialization options: size: 4, global rank: 3, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0, SPLIT_COLOR: 0, PG Name: 0
[I1022 17:07:44.322058372 ProcessGroupNCCL.cpp:914] [PG ID 0 PG GUID 0 Rank 2] ProcessGroupNCCL environments: NCCL version: 2.21.5, TORCH_NCCL_ASYNC_ERROR_HANDLING: 1, TORCH_NCCL_DUMP_ON_TIMEOUT: 1, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 1, TORCH_NCCL_ENABLE_TIMING: 1, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: DETAIL, TORCH_NCCL_USE_TENSOR_REGISTER_ALLOCATOR_HOOK: 0, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 480, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0, TORCH_NCCL_CUDA_EVENT_CACHE: 0, TORCH_NCCL_LOG_CPP_STACK_ON_UNCLEAN_SHUTDOWN: 1
[I1022 17:07:44.322060888 ProcessGroupNCCL.cpp:914] [PG ID 0 PG GUID 0 Rank 3] ProcessGroupNCCL environments: NCCL version: 2.21.5, TORCH_NCCL_ASYNC_ERROR_HANDLING: 1, TORCH_NCCL_DUMP_ON_TIMEOUT: 1, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 1, TORCH_NCCL_ENABLE_TIMING: 1, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: DETAIL, TORCH_NCCL_USE_TENSOR_REGISTER_ALLOCATOR_HOOK: 0, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 480, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0, TORCH_NCCL_CUDA_EVENT_CACHE: 0, TORCH_NCCL_LOG_CPP_STACK_ON_UNCLEAN_SHUTDOWN: 1
[I1022 17:07:44.334103873 socket.cpp:773] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (127.0.0.1, 29500).
[I1022 17:07:44.334160955 socket.cpp:847] [c10d - trace] The client socket is attempting to connect to [::ffff:127.0.0.1]:29500.
[I1022 17:07:44.334705101 socket.cpp:938] [c10d] The client socket has connected to [::ffff:127.0.0.1]:29500 on SocketImpl(fd=50, addr=[::ffff:127.0.0.1]:35822, remote=[::ffff:127.0.0.1]:29500).
[I1022 17:07:44.334847697 TCPStore.cpp:334] [c10d - debug] TCP client connected to host 127.0.0.1:29500
[I1022 17:07:44.335158871 ProcessGroupNCCL.cpp:905] [PG ID 0 PG GUID 0 Rank 1] ProcessGroupNCCL initialization options: size: 4, global rank: 1, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0, SPLIT_COLOR: 0, PG Name: 0
[I1022 17:07:44.335178299 ProcessGroupNCCL.cpp:914] [PG ID 0 PG GUID 0 Rank 1] ProcessGroupNCCL environments: NCCL version: 2.21.5, TORCH_NCCL_ASYNC_ERROR_HANDLING: 1, TORCH_NCCL_DUMP_ON_TIMEOUT: 1, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 1, TORCH_NCCL_ENABLE_TIMING: 1, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: DETAIL, TORCH_NCCL_USE_TENSOR_REGISTER_ALLOCATOR_HOOK: 0, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 480, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0, TORCH_NCCL_CUDA_EVENT_CACHE: 0, TORCH_NCCL_LOG_CPP_STACK_ON_UNCLEAN_SHUTDOWN: 1
INFO:torchtune.utils._logging:Running LoRAFinetuneRecipeDistributed with resolved config:
batch_size: 2
checkpointer:
_component_: torchtune.training.FullModelHFCheckpointer
checkpoint_dir: /home/user/projects/llm-train/model/llama3-8b
checkpoint_files:
- model-00001-of-00004.safetensors
- model-00002-of-00004.safetensors
- model-00003-of-00004.safetensors
- model-00004-of-00004.safetensors
model_type: LLAMA3
output_dir: outputs//llama3-8b-test
recipe_checkpoint: null
compile: false
dataset:
_component_: torchtune.datasets.SFTDataset
source:
name: /home/user/projects/llm-train/data_preparation/new/processed_data/1010-rm-act-v8/postfiltered_sep.json
device: cuda
dtype: bf16
enable_activation_checkpointing: false
epochs: 2
gradient_accumulation_steps: 4
log_every_n_steps: 1
log_peak_memory_stats: false
loss:
_component_: torchtune.modules.loss.CEWithChunkedOutputLoss
lr_scheduler:
_component_: torchtune.training.lr_schedulers.get_cosine_schedule_with_warmup
num_warmup_steps: 10
max_steps_per_epoch: null
metric_logger:
_component_: torchtune.training.metric_logging.DiskLogger
log_dir: outputs//llama3-8b-test
model:
_component_: torchtune.models.llama3_1.lora_llama3_1_8b
apply_lora_to_mlp: true
apply_lora_to_output: false
lora_alpha: 16
lora_attn_modules:
- q_proj
- v_proj
lora_dropout: 0.0
lora_rank: 8
optimizer:
_component_: torch.optim.AdamW
fused: true
lr: 0.0003
weight_decay: 0.01
output_dir: outputs//llama3-8b-test
resume_from_checkpoint: false
seed: 42
shuffle: true
tokenizer:
_component_: torchtune.modules.tokenizers.HFTokenizer
max_seq_len: null
path: /home/user/projects/llm-train/model/llama3-8b
DEBUG:torchtune.utils._logging:Setting manual seed to local seed 42. Local seed is seed + rank = 42 + 0
Writing logs to outputs/llama3-8b-test/log_1729588066.txt
INFO:torchtune.utils._logging:FSDP is enabled. Instantiating model and loading checkpoint on Rank 0 ...
[rank2]:[I1022 17:07:47.551950116 ProcessGroupWrapper.cpp:587] [Rank 2] Running collective: CollectiveFingerPrint(SequenceNumber=0, OpType=BROADCAST, TensorShape=[16], TensorDtypes=Byte, TensorDeviceTypes=TensorOptions(dtype=c10::BFloat16 (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt)))
[rank3]:[I1022 17:07:47.558312214 ProcessGroupWrapper.cpp:587] [Rank 3] Running collective: CollectiveFingerPrint(SequenceNumber=0, OpType=BROADCAST, TensorShape=[16], TensorDtypes=Byte, TensorDeviceTypes=TensorOptions(dtype=c10::BFloat16 (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt)))
[rank1]:[I1022 17:07:47.559710289 ProcessGroupWrapper.cpp:587] [Rank 1] Running collective: CollectiveFingerPrint(SequenceNumber=0, OpType=BROADCAST, TensorShape=[16], TensorDtypes=Byte, TensorDeviceTypes=TensorOptions(dtype=c10::BFloat16 (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt)))
[rank0]:[I1022 17:07:47.658896824 ProcessGroupWrapper.cpp:587] [Rank 0] Running collective: CollectiveFingerPrint(SequenceNumber=0, OpType=BROADCAST, TensorShape=[16], TensorDtypes=Byte, TensorDeviceTypes=TensorOptions(dtype=c10::BFloat16 (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt)))
[rank0]:[I1022 17:07:47.661509864 ProcessGroupNCCL.cpp:2262] [PG ID 0 PG GUID 0 Rank 0] ProcessGroupNCCL broadcast unique ID through store took 0.033109 ms
[rank2]:[I1022 17:07:47.661602191 ProcessGroupNCCL.cpp:2262] [PG ID 0 PG GUID 0 Rank 2] ProcessGroupNCCL broadcast unique ID through store took 1.32689 ms
[rank3]:[I1022 17:07:47.661607930 ProcessGroupNCCL.cpp:2262] [PG ID 0 PG GUID 0 Rank 3] ProcessGroupNCCL broadcast unique ID through store took 1.32036 ms
[rank1]:[I1022 17:07:47.661625472 ProcessGroupNCCL.cpp:2262] [PG ID 0 PG GUID 0 Rank 1] ProcessGroupNCCL broadcast unique ID through store took 1.3291 ms
NCCL version 2.21.5+cuda12.4
[I1022 17:07:48.778291492 TCPStoreLibUvBackend.cpp:119] [c10d - debug] Read callback failed. code:-4095 name:EOF desc:end of file
[I1022 17:07:48.804306340 TCPStoreLibUvBackend.cpp:119] [c10d - debug] Read callback failed. code:-4095 name:EOF desc:end of file
[I1022 17:07:48.829952539 TCPStoreLibUvBackend.cpp:119] [c10d - debug] Read callback failed. code:-4095 name:EOF desc:end of file
W1022 17:07:48.975000 2790913 /data/user/miniconda3/envs/torchtune/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 2791076 closing signal SIGTERM
W1022 17:07:48.979000 2790913 /data/user/miniconda3/envs/torchtune/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 2791078 closing signal SIGTERM
[I1022 17:07:48.857831227 TCPStoreLibUvBackend.cpp:119] [c10d - debug] Read callback failed. code:-4095 name:EOF desc:end of file
E1022 17:07:49.044000 2790913 /data/user/miniconda3/envs/torchtune/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: -11) local_rank: 0 (pid: 2791075) of binary: /home/user/miniconda3/envs/torchtune/bin/python
Traceback (most recent call last):
File "/home/user/miniconda3/envs/torchtune/bin/tune", line 8, in <module>
sys.exit(main())
File "/data/user/shared/torchtune/torchtune/_cli/tune.py", line 49, in main
parser.run(args)
File "/data/user/shared/torchtune/torchtune/_cli/tune.py", line 43, in run
args.func(args)
File "/data/user/shared/torchtune/torchtune/_cli/run.py", line 207, in _run_cmd
self._run_distributed(args, is_builtin=is_builtin)
File "/home/user/miniconda3/envs/torchtune/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
return f(*args, **kwargs)
File "/data/user/shared/torchtune/torchtune/_cli/run.py", line 96, in _run_distributed
run(args)
File "/home/user/miniconda3/envs/torchtune/lib/python3.10/site-packages/torch/distributed/run.py", line 910, in run
elastic_launch(
File "/home/user/miniconda3/envs/torchtune/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/user/miniconda3/envs/torchtune/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
/data/user/shared/torchtune/recipes/lora_finetune_distributed.py FAILED
------------------------------------------------------------
Failures:
[1]:
time : 2024-10-22_17:07:48
host : 10-7-133-248
rank : 2 (local_rank: 2)
exitcode : -11 (pid: 2791077)
error_file: <N/A>
traceback : Signal 11 (SIGSEGV) received by PID 2791077
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-10-22_17:07:48
host : 10-7-133-248
rank : 0 (local_rank: 0)
exitcode : -11 (pid: 2791075)
error_file: <N/A>
traceback : Signal 11 (SIGSEGV) received by PID 2791075
============================================================
[I1022 17:07:49.742841786 TCPStoreLibUvBackend.cpp:119] [c10d - debug] Read callback failed. code:-4095 name:EOF desc:end of file
[I1022 17:07:49.742932318 TCPStoreLibUvBackend.cpp:1033] [c10d - debug] Store exit requested
[I1022 17:07:49.742943128 TCPStoreLibUvBackend.cpp:1103] [c10d - debug] UV main loop done: res:1
[I1022 17:07:49.742959887 TCPStoreLibUvBackend.cpp:1109] [c10d - debug] Walking live handles prior to closing clients
[I1022 17:07:49.742965980 TCPStoreLibUvBackend.cpp:1090] [c10d - debug] UV live handle type 12 active:1 is-closing:0
[I1022 17:07:49.742975905 TCPStoreLibUvBackend.cpp:1119] [c10d - debug] Walking live handles after closing clients
[I1022 17:07:49.742982589 TCPStoreLibUvBackend.cpp:1090] [c10d - debug] UV live handle type 12 active:0 is-closing:1
[I1022 17:07:49.742985997 TCPStoreLibUvBackend.cpp:1128] [c10d] uv_loop_close failed with:-16 errn:EBUSY desc:resource busy or locked
[I1022 17:07:49.743006792 TCPStoreLibUvBackend.cpp:1138] [c10d] uv_loop cleanup finished.
my CUDA version is 12.1, pytorch is 2.5.0, and torchtune is built from the latest git repo.
The program run successfully when using the lora_finetune_single_device
recipe.