Training process exits with code -11 when broadcasting a tensor

I meet the following error when I use torchtune to train a model CUDA_VISIBLE_DEVICES=4,5,6,7 tune run --nproc_per_node 4 lora_finetune_distributed --config llama3_1/8B_lora.yaml.

the process exit at work = default_pg.broadcast([tensor], opts) (L2417, torch.distributed.distributed_c10d.py) of random._rng_tracker = OffsetBasedRNGTracker(device_type)(L685, torch.distributed.tensor._api.py) of distribute_tensor(L338, torchtune/training/_distributed.py) of training.load_from_full_model_state_dict(L503, lora_finetune_distributed.py) , without any other error messages.

Here are the detailed outputs:

[I1022 17:07:37.797161243 debug.cpp:49] [c10d] The debug level is set to DETAIL.
[I1022 17:07:39.836860510 debug.cpp:49] [c10d] The debug level is set to DETAIL.
Running with torchrun...
Namespace(func=<bound method Run._run_cmd of <torchtune._cli.run.Run object at 0x7fe27ff10940>>, nnodes='1:1', nproc_per_node='4', rdzv_backend='static', rdzv_endpoint='', rdzv_id='none', rdzv_conf='', standalone=False, max_restarts=0, monitor_interval=0.1, start_method='spawn', role='default', module=False, no_python=False, run_path=False, log_dir=None, redirects='0', tee='0', local_ranks_filter='', node_rank=0, master_addr='127.0.0.1', master_port=29500, local_addr=None, logs_specs=None, recipe='/data/user/shared/torchtune/recipes/lora_finetune_distributed.py', recipe_args=['--config', '/home/user/shared/torchtune/recipes/configs/llama3_1/8B_lora.yaml'], training_script='/data/user/shared/torchtune/recipes/lora_finetune_distributed.py', training_script_args=['--config', '/home/user/shared/torchtune/recipes/configs/llama3_1/8B_lora.yaml'])
W1022 17:07:40.830000 2790913 /data/user/miniconda3/envs/torchtune/lib/python3.10/site-packages/torch/distributed/run.py:793] 
W1022 17:07:40.830000 2790913 /data/user/miniconda3/envs/torchtune/lib/python3.10/site-packages/torch/distributed/run.py:793] *****************************************
W1022 17:07:40.830000 2790913 /data/user/miniconda3/envs/torchtune/lib/python3.10/site-packages/torch/distributed/run.py:793] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W1022 17:07:40.830000 2790913 /data/user/miniconda3/envs/torchtune/lib/python3.10/site-packages/torch/distributed/run.py:793] *****************************************
[I1022 17:07:40.696906459 TCPStore.cpp:298] [c10d - debug] The server has started on port = 29500.
[I1022 17:07:40.696954905 TCPStoreLibUvBackend.cpp:1100] [c10d - debug] Uv main loop running
[I1022 17:07:40.697016370 socket.cpp:773] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (127.0.0.1, 29500).
[I1022 17:07:40.697062480 socket.cpp:847] [c10d - trace] The client socket is attempting to connect to [::ffff:127.0.0.1]:29500.
[I1022 17:07:40.697840892 socket.cpp:938] [c10d] The client socket has connected to [::ffff:127.0.0.1]:29500 on SocketImpl(fd=72, addr=[::ffff:127.0.0.1]:45702, remote=[::ffff:127.0.0.1]:29500).
[I1022 17:07:40.698006414 TCPStore.cpp:334] [c10d - debug] TCP client connected to host 127.0.0.1:29500
[I1022 17:07:41.356534965 debug.cpp:49] [c10d] The debug level is set to DETAIL.
[I1022 17:07:41.356535002 debug.cpp:49] [c10d] The debug level is set to DETAIL.
[I1022 17:07:41.358519543 debug.cpp:49] [c10d] The debug level is set to DETAIL.
[I1022 17:07:41.360214939 debug.cpp:49] [c10d] The debug level is set to DETAIL.
[I1022 17:07:43.380146172 debug.cpp:49] [c10d] The debug level is set to DETAIL.
[I1022 17:07:43.413592241 debug.cpp:49] [c10d] The debug level is set to DETAIL.
[I1022 17:07:43.425319630 debug.cpp:49] [c10d] The debug level is set to DETAIL.
[I1022 17:07:43.428458372 debug.cpp:49] [c10d] The debug level is set to DETAIL.
[I1022 17:07:44.217347549 socket.cpp:773] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (127.0.0.1, 29500).
[I1022 17:07:44.217418420 socket.cpp:847] [c10d - trace] The client socket is attempting to connect to [::ffff:127.0.0.1]:29500.
[I1022 17:07:44.218081149 socket.cpp:938] [c10d] The client socket has connected to [::ffff:127.0.0.1]:29500 on SocketImpl(fd=50, addr=[::ffff:127.0.0.1]:35816, remote=[::ffff:127.0.0.1]:29500).
[I1022 17:07:44.218207459 TCPStore.cpp:334] [c10d - debug] TCP client connected to host 127.0.0.1:29500
[I1022 17:07:44.218792363 ProcessGroupNCCL.cpp:905] [PG ID 0 PG GUID 0 Rank 0] ProcessGroupNCCL initialization options: size: 4, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0, SPLIT_COLOR: 0, PG Name: 0
[I1022 17:07:44.218819744 ProcessGroupNCCL.cpp:914] [PG ID 0 PG GUID 0 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.21.5, TORCH_NCCL_ASYNC_ERROR_HANDLING: 1, TORCH_NCCL_DUMP_ON_TIMEOUT: 1, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 1, TORCH_NCCL_ENABLE_TIMING: 1, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: DETAIL, TORCH_NCCL_USE_TENSOR_REGISTER_ALLOCATOR_HOOK: 0, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 480, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0, TORCH_NCCL_CUDA_EVENT_CACHE: 0, TORCH_NCCL_LOG_CPP_STACK_ON_UNCLEAN_SHUTDOWN: 1
[I1022 17:07:44.320745613 socket.cpp:773] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (127.0.0.1, 29500).
[I1022 17:07:44.320819673 socket.cpp:773] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (127.0.0.1, 29500).
[I1022 17:07:44.320876774 socket.cpp:847] [c10d - trace] The client socket is attempting to connect to [::ffff:127.0.0.1]:29500.
[I1022 17:07:44.320818546 socket.cpp:847] [c10d - trace] The client socket is attempting to connect to [::ffff:127.0.0.1]:29500.
[I1022 17:07:44.321534202 socket.cpp:938] [c10d] The client socket has connected to [::ffff:127.0.0.1]:29500 on SocketImpl(fd=50, addr=[::ffff:127.0.0.1]:35818, remote=[::ffff:127.0.0.1]:29500).
[I1022 17:07:44.321666135 TCPStore.cpp:334] [c10d - debug] TCP client connected to host 127.0.0.1:29500
[I1022 17:07:44.321551946 socket.cpp:938] [c10d] The client socket has connected to [::ffff:127.0.0.1]:29500 on SocketImpl(fd=50, addr=[::ffff:127.0.0.1]:35820, remote=[::ffff:127.0.0.1]:29500).
[I1022 17:07:44.321683112 TCPStore.cpp:334] [c10d - debug] TCP client connected to host 127.0.0.1:29500
[I1022 17:07:44.322037997 ProcessGroupNCCL.cpp:905] [PG ID 0 PG GUID 0 Rank 2] ProcessGroupNCCL initialization options: size: 4, global rank: 2, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0, SPLIT_COLOR: 0, PG Name: 0
[I1022 17:07:44.322040044 ProcessGroupNCCL.cpp:905] [PG ID 0 PG GUID 0 Rank 3] ProcessGroupNCCL initialization options: size: 4, global rank: 3, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0, SPLIT_COLOR: 0, PG Name: 0
[I1022 17:07:44.322058372 ProcessGroupNCCL.cpp:914] [PG ID 0 PG GUID 0 Rank 2] ProcessGroupNCCL environments: NCCL version: 2.21.5, TORCH_NCCL_ASYNC_ERROR_HANDLING: 1, TORCH_NCCL_DUMP_ON_TIMEOUT: 1, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 1, TORCH_NCCL_ENABLE_TIMING: 1, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: DETAIL, TORCH_NCCL_USE_TENSOR_REGISTER_ALLOCATOR_HOOK: 0, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 480, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0, TORCH_NCCL_CUDA_EVENT_CACHE: 0, TORCH_NCCL_LOG_CPP_STACK_ON_UNCLEAN_SHUTDOWN: 1
[I1022 17:07:44.322060888 ProcessGroupNCCL.cpp:914] [PG ID 0 PG GUID 0 Rank 3] ProcessGroupNCCL environments: NCCL version: 2.21.5, TORCH_NCCL_ASYNC_ERROR_HANDLING: 1, TORCH_NCCL_DUMP_ON_TIMEOUT: 1, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 1, TORCH_NCCL_ENABLE_TIMING: 1, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: DETAIL, TORCH_NCCL_USE_TENSOR_REGISTER_ALLOCATOR_HOOK: 0, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 480, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0, TORCH_NCCL_CUDA_EVENT_CACHE: 0, TORCH_NCCL_LOG_CPP_STACK_ON_UNCLEAN_SHUTDOWN: 1
[I1022 17:07:44.334103873 socket.cpp:773] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (127.0.0.1, 29500).
[I1022 17:07:44.334160955 socket.cpp:847] [c10d - trace] The client socket is attempting to connect to [::ffff:127.0.0.1]:29500.
[I1022 17:07:44.334705101 socket.cpp:938] [c10d] The client socket has connected to [::ffff:127.0.0.1]:29500 on SocketImpl(fd=50, addr=[::ffff:127.0.0.1]:35822, remote=[::ffff:127.0.0.1]:29500).
[I1022 17:07:44.334847697 TCPStore.cpp:334] [c10d - debug] TCP client connected to host 127.0.0.1:29500
[I1022 17:07:44.335158871 ProcessGroupNCCL.cpp:905] [PG ID 0 PG GUID 0 Rank 1] ProcessGroupNCCL initialization options: size: 4, global rank: 1, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0, SPLIT_COLOR: 0, PG Name: 0
[I1022 17:07:44.335178299 ProcessGroupNCCL.cpp:914] [PG ID 0 PG GUID 0 Rank 1] ProcessGroupNCCL environments: NCCL version: 2.21.5, TORCH_NCCL_ASYNC_ERROR_HANDLING: 1, TORCH_NCCL_DUMP_ON_TIMEOUT: 1, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 1, TORCH_NCCL_ENABLE_TIMING: 1, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: DETAIL, TORCH_NCCL_USE_TENSOR_REGISTER_ALLOCATOR_HOOK: 0, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 480, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0, TORCH_NCCL_CUDA_EVENT_CACHE: 0, TORCH_NCCL_LOG_CPP_STACK_ON_UNCLEAN_SHUTDOWN: 1
INFO:torchtune.utils._logging:Running LoRAFinetuneRecipeDistributed with resolved config:

batch_size: 2
checkpointer:
  _component_: torchtune.training.FullModelHFCheckpointer
  checkpoint_dir: /home/user/projects/llm-train/model/llama3-8b
  checkpoint_files:
  - model-00001-of-00004.safetensors
  - model-00002-of-00004.safetensors
  - model-00003-of-00004.safetensors
  - model-00004-of-00004.safetensors
  model_type: LLAMA3
  output_dir: outputs//llama3-8b-test
  recipe_checkpoint: null
compile: false
dataset:
  _component_: torchtune.datasets.SFTDataset
  source:
    name: /home/user/projects/llm-train/data_preparation/new/processed_data/1010-rm-act-v8/postfiltered_sep.json
device: cuda
dtype: bf16
enable_activation_checkpointing: false
epochs: 2
gradient_accumulation_steps: 4
log_every_n_steps: 1
log_peak_memory_stats: false
loss:
  _component_: torchtune.modules.loss.CEWithChunkedOutputLoss
lr_scheduler:
  _component_: torchtune.training.lr_schedulers.get_cosine_schedule_with_warmup
  num_warmup_steps: 10
max_steps_per_epoch: null
metric_logger:
  _component_: torchtune.training.metric_logging.DiskLogger
  log_dir: outputs//llama3-8b-test
model:
  _component_: torchtune.models.llama3_1.lora_llama3_1_8b
  apply_lora_to_mlp: true
  apply_lora_to_output: false
  lora_alpha: 16
  lora_attn_modules:
  - q_proj
  - v_proj
  lora_dropout: 0.0
  lora_rank: 8
optimizer:
  _component_: torch.optim.AdamW
  fused: true
  lr: 0.0003
  weight_decay: 0.01
output_dir: outputs//llama3-8b-test
resume_from_checkpoint: false
seed: 42
shuffle: true
tokenizer:
  _component_: torchtune.modules.tokenizers.HFTokenizer
  max_seq_len: null
  path: /home/user/projects/llm-train/model/llama3-8b

DEBUG:torchtune.utils._logging:Setting manual seed to local seed 42. Local seed is seed + rank = 42 + 0
Writing logs to outputs/llama3-8b-test/log_1729588066.txt
INFO:torchtune.utils._logging:FSDP is enabled. Instantiating model and loading checkpoint on Rank 0 ...
[rank2]:[I1022 17:07:47.551950116 ProcessGroupWrapper.cpp:587] [Rank 2] Running collective: CollectiveFingerPrint(SequenceNumber=0, OpType=BROADCAST, TensorShape=[16], TensorDtypes=Byte, TensorDeviceTypes=TensorOptions(dtype=c10::BFloat16 (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt)))
[rank3]:[I1022 17:07:47.558312214 ProcessGroupWrapper.cpp:587] [Rank 3] Running collective: CollectiveFingerPrint(SequenceNumber=0, OpType=BROADCAST, TensorShape=[16], TensorDtypes=Byte, TensorDeviceTypes=TensorOptions(dtype=c10::BFloat16 (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt)))
[rank1]:[I1022 17:07:47.559710289 ProcessGroupWrapper.cpp:587] [Rank 1] Running collective: CollectiveFingerPrint(SequenceNumber=0, OpType=BROADCAST, TensorShape=[16], TensorDtypes=Byte, TensorDeviceTypes=TensorOptions(dtype=c10::BFloat16 (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt)))
[rank0]:[I1022 17:07:47.658896824 ProcessGroupWrapper.cpp:587] [Rank 0] Running collective: CollectiveFingerPrint(SequenceNumber=0, OpType=BROADCAST, TensorShape=[16], TensorDtypes=Byte, TensorDeviceTypes=TensorOptions(dtype=c10::BFloat16 (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt)))
[rank0]:[I1022 17:07:47.661509864 ProcessGroupNCCL.cpp:2262] [PG ID 0 PG GUID 0 Rank 0] ProcessGroupNCCL broadcast unique ID through store took 0.033109 ms
[rank2]:[I1022 17:07:47.661602191 ProcessGroupNCCL.cpp:2262] [PG ID 0 PG GUID 0 Rank 2] ProcessGroupNCCL broadcast unique ID through store took 1.32689 ms
[rank3]:[I1022 17:07:47.661607930 ProcessGroupNCCL.cpp:2262] [PG ID 0 PG GUID 0 Rank 3] ProcessGroupNCCL broadcast unique ID through store took 1.32036 ms
[rank1]:[I1022 17:07:47.661625472 ProcessGroupNCCL.cpp:2262] [PG ID 0 PG GUID 0 Rank 1] ProcessGroupNCCL broadcast unique ID through store took 1.3291 ms
NCCL version 2.21.5+cuda12.4
[I1022 17:07:48.778291492 TCPStoreLibUvBackend.cpp:119] [c10d - debug] Read callback failed. code:-4095 name:EOF desc:end of file
[I1022 17:07:48.804306340 TCPStoreLibUvBackend.cpp:119] [c10d - debug] Read callback failed. code:-4095 name:EOF desc:end of file
[I1022 17:07:48.829952539 TCPStoreLibUvBackend.cpp:119] [c10d - debug] Read callback failed. code:-4095 name:EOF desc:end of file
W1022 17:07:48.975000 2790913 /data/user/miniconda3/envs/torchtune/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 2791076 closing signal SIGTERM
W1022 17:07:48.979000 2790913 /data/user/miniconda3/envs/torchtune/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 2791078 closing signal SIGTERM
[I1022 17:07:48.857831227 TCPStoreLibUvBackend.cpp:119] [c10d - debug] Read callback failed. code:-4095 name:EOF desc:end of file
E1022 17:07:49.044000 2790913 /data/user/miniconda3/envs/torchtune/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: -11) local_rank: 0 (pid: 2791075) of binary: /home/user/miniconda3/envs/torchtune/bin/python
Traceback (most recent call last):
  File "/home/user/miniconda3/envs/torchtune/bin/tune", line 8, in <module>
    sys.exit(main())
  File "/data/user/shared/torchtune/torchtune/_cli/tune.py", line 49, in main
    parser.run(args)
  File "/data/user/shared/torchtune/torchtune/_cli/tune.py", line 43, in run
    args.func(args)
  File "/data/user/shared/torchtune/torchtune/_cli/run.py", line 207, in _run_cmd
    self._run_distributed(args, is_builtin=is_builtin)
  File "/home/user/miniconda3/envs/torchtune/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
    return f(*args, **kwargs)
  File "/data/user/shared/torchtune/torchtune/_cli/run.py", line 96, in _run_distributed
    run(args)
  File "/home/user/miniconda3/envs/torchtune/lib/python3.10/site-packages/torch/distributed/run.py", line 910, in run
    elastic_launch(
  File "/home/user/miniconda3/envs/torchtune/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/user/miniconda3/envs/torchtune/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/data/user/shared/torchtune/recipes/lora_finetune_distributed.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2024-10-22_17:07:48
  host      : 10-7-133-248
  rank      : 2 (local_rank: 2)
  exitcode  : -11 (pid: 2791077)
  error_file: <N/A>
  traceback : Signal 11 (SIGSEGV) received by PID 2791077
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-10-22_17:07:48
  host      : 10-7-133-248
  rank      : 0 (local_rank: 0)
  exitcode  : -11 (pid: 2791075)
  error_file: <N/A>
  traceback : Signal 11 (SIGSEGV) received by PID 2791075
============================================================
[I1022 17:07:49.742841786 TCPStoreLibUvBackend.cpp:119] [c10d - debug] Read callback failed. code:-4095 name:EOF desc:end of file
[I1022 17:07:49.742932318 TCPStoreLibUvBackend.cpp:1033] [c10d - debug] Store exit requested

[I1022 17:07:49.742943128 TCPStoreLibUvBackend.cpp:1103] [c10d - debug] UV main loop done: res:1
[I1022 17:07:49.742959887 TCPStoreLibUvBackend.cpp:1109] [c10d - debug] Walking live handles prior to closing clients
[I1022 17:07:49.742965980 TCPStoreLibUvBackend.cpp:1090] [c10d - debug] UV live handle type 12 active:1 is-closing:0
[I1022 17:07:49.742975905 TCPStoreLibUvBackend.cpp:1119] [c10d - debug] Walking live handles after closing clients
[I1022 17:07:49.742982589 TCPStoreLibUvBackend.cpp:1090] [c10d - debug] UV live handle type 12 active:0 is-closing:1
[I1022 17:07:49.742985997 TCPStoreLibUvBackend.cpp:1128] [c10d] uv_loop_close failed with:-16 errn:EBUSY desc:resource busy or locked
[I1022 17:07:49.743006792 TCPStoreLibUvBackend.cpp:1138] [c10d] uv_loop cleanup finished.

my CUDA version is 12.1, pytorch is 2.5.0, and torchtune is built from the latest git repo.

The program run successfully when using the lora_finetune_single_device recipe.

[I1022 17:07:49.742985997 TCPStoreLibUvBackend.cpp:1128] [c10d] uv_loop_close failed with:-16 errn:EBUSY desc:resource busy or locked

the log suggests there is an error during torch.distributed.init_process_group. Are you able to run a simple distributed script like following without init error?

# torchrun --standalone --nproc_per_node=2 run_fsdp2.py

import os
import pickle

import torch
import torch.distributed as dist
import torch.nn as nn
from torch.distributed._composable.fsdp import fully_shard


def main():
    dist.init_process_group(backend="nccl")
    gpu_id = int(os.environ["LOCAL_RANK"])
    device = f"cuda:{gpu_id}"
    torch.cuda.set_device(device)
    torch.manual_seed(0)
    model = nn.Sequential(
        *[nn.Linear(4, 4, device=device, bias=False) for _ in range(2)]
    )
    for layer in model:
        fully_shard(layer)
    fully_shard(model)
    optim = torch.optim.Adam(model.parameters(), lr=1e-2)
    x = torch.randn((4, 4), device=device)
    model(x).sum().backward()
    optim.step()


if __name__ == "__main__":
    main()

I cannot run this script. here are the logs:

W1023 13:24:14.029000 2921205 /data/user/miniconda3/envs/torchtune/lib/python3.10/site-packages/torch/distributed/run.py:793] 
W1023 13:24:14.029000 2921205 /data/user/miniconda3/envs/torchtune/lib/python3.10/site-packages/torch/distributed/run.py:793] *****************************************
W1023 13:24:14.029000 2921205 /data/user/miniconda3/envs/torchtune/lib/python3.10/site-packages/torch/distributed/run.py:793] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W1023 13:24:14.029000 2921205 /data/user/miniconda3/envs/torchtune/lib/python3.10/site-packages/torch/distributed/run.py:793] *****************************************
10-7-133-248:2921282:2921282 [0] NCCL INFO Bootstrap : Using eth0:10.7.133.248<0>
10-7-133-248:2921282:2921282 [0] NCCL INFO NET/Plugin: No plugin found (libnccl-net.so)
10-7-133-248:2921282:2921282 [0] NCCL INFO NET/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-net.so
10-7-133-248:2921282:2921282 [0] NCCL INFO NET/Plugin: Using internal network plugin.
10-7-133-248:2921282:2921282 [0] NCCL INFO cudaDriverVersion 12020
NCCL version 2.21.5+cuda12.4
10-7-133-248:2921283:2921283 [1] NCCL INFO cudaDriverVersion 12020
10-7-133-248:2921283:2921283 [1] NCCL INFO Bootstrap : Using eth0:10.7.133.248<0>
10-7-133-248:2921283:2921283 [1] NCCL INFO NET/Plugin: No plugin found (libnccl-net.so)
10-7-133-248:2921283:2921283 [1] NCCL INFO NET/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-net.so
10-7-133-248:2921283:2921283 [1] NCCL INFO NET/Plugin: Using internal network plugin.
10-7-133-248:2921282:2921302 [0] NCCL INFO NET/IB : Using [0]rdma_3:1/RoCE [1]rdma_2:1/RoCE [2]rdma_1:1/RoCE [3]rdma_0:1/RoCE [4]mlx5_4:1/RoCE [RO]; OOB eth0:10.7.133.248<0>
10-7-133-248:2921282:2921302 [0] NCCL INFO Using non-device net plugin version 0
10-7-133-248:2921282:2921302 [0] NCCL INFO Using network IB
10-7-133-248:2921283:2921303 [1] NCCL INFO NET/IB : Using [0]rdma_3:1/RoCE [1]rdma_2:1/RoCE [2]rdma_1:1/RoCE [3]rdma_0:1/RoCE [4]mlx5_4:1/RoCE [RO]; OOB eth0:10.7.133.248<0>
10-7-133-248:2921283:2921303 [1] NCCL INFO Using non-device net plugin version 0
10-7-133-248:2921283:2921303 [1] NCCL INFO Using network IB
10-7-133-248:2921283:2921303 [1] NCCL INFO ncclCommInitRank comm 0xb42e600 rank 1 nranks 2 cudaDev 1 nvmlDev 5 busId 95000 commId 0x4a30f58ffecec16a - Init START
10-7-133-248:2921282:2921302 [0] NCCL INFO ncclCommInitRank comm 0xafe1a50 rank 0 nranks 2 cudaDev 0 nvmlDev 4 busId 94000 commId 0x4a30f58ffecec16a - Init START
W1023 13:24:17.507000 2921205 /data/user/miniconda3/envs/torchtune/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 2921283 closing signal SIGTERM
E1023 13:24:17.539000 2921205 /data/user/miniconda3/envs/torchtune/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: -11) local_rank: 0 (pid: 2921282) of binary: /home/user/miniconda3/envs/torchtune/bin/python

I can run this script successfully in another python environment on the same machine, here are the logs.

W1023 13:24:46.562000 140095854105216 torch/distributed/run.py:779] 
W1023 13:24:46.562000 140095854105216 torch/distributed/run.py:779] *****************************************
W1023 13:24:46.562000 140095854105216 torch/distributed/run.py:779] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W1023 13:24:46.562000 140095854105216 torch/distributed/run.py:779] *****************************************
10-7-133-248:2921365:2921365 [0] NCCL INFO Bootstrap : Using eth0:10.7.133.248<0>
10-7-133-248:2921365:2921365 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
10-7-133-248:2921365:2921365 [0] NCCL INFO cudaDriverVersion 12020
NCCL version 2.20.5+cuda12.4
10-7-133-248:2921366:2921366 [1] NCCL INFO cudaDriverVersion 12020
10-7-133-248:2921366:2921366 [1] NCCL INFO Bootstrap : Using eth0:10.7.133.248<0>
10-7-133-248:2921366:2921366 [1] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
10-7-133-248:2921365:2921405 [0] NCCL INFO NET/IB : Using [0]rdma_3:1/RoCE [1]rdma_2:1/RoCE [2]rdma_1:1/RoCE [3]rdma_0:1/RoCE [4]mlx5_4:1/RoCE [RO]; OOB eth0:10.7.133.248<0>
10-7-133-248:2921366:2921406 [1] NCCL INFO NET/IB : Using [0]rdma_3:1/RoCE [1]rdma_2:1/RoCE [2]rdma_1:1/RoCE [3]rdma_0:1/RoCE [4]mlx5_4:1/RoCE [RO]; OOB eth0:10.7.133.248<0>
10-7-133-248:2921365:2921405 [0] NCCL INFO Using non-device net plugin version 0
10-7-133-248:2921365:2921405 [0] NCCL INFO Using network IB
10-7-133-248:2921366:2921406 [1] NCCL INFO Using non-device net plugin version 0
10-7-133-248:2921366:2921406 [1] NCCL INFO Using network IB
10-7-133-248:2921366:2921406 [1] NCCL INFO comm 0xa0cd0f0 rank 1 nranks 2 cudaDev 1 nvmlDev 5 busId 95000 commId 0x1b99358c45e6af2 - Init START
10-7-133-248:2921365:2921405 [0] NCCL INFO comm 0xa671100 rank 0 nranks 2 cudaDev 0 nvmlDev 4 busId 94000 commId 0x1b99358c45e6af2 - Init START
10-7-133-248:2921365:2921405 [0] NCCL INFO Setting affinity for GPU 4 to ffffffff,ffffffff,00000000,00000000
10-7-133-248:2921366:2921406 [1] NCCL INFO Setting affinity for GPU 5 to ffffffff,ffffffff,00000000,00000000
10-7-133-248:2921366:2921406 [1] NCCL INFO comm 0xa0cd0f0 rank 1 nRanks 2 nNodes 1 localRanks 2 localRank 1 MNNVL 0
10-7-133-248:2921365:2921405 [0] NCCL INFO comm 0xa671100 rank 0 nRanks 2 nNodes 1 localRanks 2 localRank 0 MNNVL 0
10-7-133-248:2921365:2921405 [0] NCCL INFO Channel 00/16 :    0   1
10-7-133-248:2921366:2921406 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0 [2] -1/-1/-1->1->0 [3] -1/-1/-1->1->0 [4] 0/-1/-1->1->-1 [5] 0/-1/-1->1->-1 [6] 0/-1/-1->1->-1 [7] 0/-1/-1->1->-1 [8] -1/-1/-1->1->0 [9] -1/-1/-1->1->0 [10] -1/-1/-1->1->0 [11] -1/-1/-1->1->0 [12] 0/-1/-1->1->-1 [13] 0/-1/-1->1->-1 [14] 0/-1/-1->1->-1 [15] 0/-1/-1->1->-1
10-7-133-248:2921365:2921405 [0] NCCL INFO Channel 01/16 :    0   1
10-7-133-248:2921366:2921406 [1] NCCL INFO P2P Chunksize set to 524288
10-7-133-248:2921365:2921405 [0] NCCL INFO Channel 02/16 :    0   1
10-7-133-248:2921365:2921405 [0] NCCL INFO Channel 03/16 :    0   1
10-7-133-248:2921365:2921405 [0] NCCL INFO Channel 04/16 :    0   1
10-7-133-248:2921365:2921405 [0] NCCL INFO Channel 05/16 :    0   1
10-7-133-248:2921365:2921405 [0] NCCL INFO Channel 06/16 :    0   1
10-7-133-248:2921365:2921405 [0] NCCL INFO Channel 07/16 :    0   1
10-7-133-248:2921365:2921405 [0] NCCL INFO Channel 08/16 :    0   1
10-7-133-248:2921365:2921405 [0] NCCL INFO Channel 09/16 :    0   1
10-7-133-248:2921365:2921405 [0] NCCL INFO Channel 10/16 :    0   1
10-7-133-248:2921365:2921405 [0] NCCL INFO Channel 11/16 :    0   1
10-7-133-248:2921365:2921405 [0] NCCL INFO Channel 12/16 :    0   1
10-7-133-248:2921365:2921405 [0] NCCL INFO Channel 13/16 :    0   1
10-7-133-248:2921365:2921405 [0] NCCL INFO Channel 14/16 :    0   1
10-7-133-248:2921365:2921405 [0] NCCL INFO Channel 15/16 :    0   1
10-7-133-248:2921365:2921405 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] -1/-1/-1->0->1 [5] -1/-1/-1->0->1 [6] -1/-1/-1->0->1 [7] -1/-1/-1->0->1 [8] 1/-1/-1->0->-1 [9] 1/-1/-1->0->-1 [10] 1/-1/-1->0->-1 [11] 1/-1/-1->0->-1 [12] -1/-1/-1->0->1 [13] -1/-1/-1->0->1 [14] -1/-1/-1->0->1 [15] -1/-1/-1->0->1
10-7-133-248:2921365:2921405 [0] NCCL INFO P2P Chunksize set to 524288
10-7-133-248:2921366:2921406 [1] NCCL INFO Channel 00/0 : 1[5] -> 0[4] via P2P/CUMEM/read
10-7-133-248:2921365:2921405 [0] NCCL INFO Channel 00/0 : 0[4] -> 1[5] via P2P/CUMEM/read
10-7-133-248:2921366:2921406 [1] NCCL INFO Channel 01/0 : 1[5] -> 0[4] via P2P/CUMEM/read
10-7-133-248:2921365:2921405 [0] NCCL INFO Channel 01/0 : 0[4] -> 1[5] via P2P/CUMEM/read
10-7-133-248:2921366:2921406 [1] NCCL INFO Channel 02/0 : 1[5] -> 0[4] via P2P/CUMEM/read
10-7-133-248:2921365:2921405 [0] NCCL INFO Channel 02/0 : 0[4] -> 1[5] via P2P/CUMEM/read
10-7-133-248:2921366:2921406 [1] NCCL INFO Channel 03/0 : 1[5] -> 0[4] via P2P/CUMEM/read
10-7-133-248:2921365:2921405 [0] NCCL INFO Channel 03/0 : 0[4] -> 1[5] via P2P/CUMEM/read
10-7-133-248:2921366:2921406 [1] NCCL INFO Channel 04/0 : 1[5] -> 0[4] via P2P/CUMEM/read
10-7-133-248:2921365:2921405 [0] NCCL INFO Channel 04/0 : 0[4] -> 1[5] via P2P/CUMEM/read
10-7-133-248:2921366:2921406 [1] NCCL INFO Channel 05/0 : 1[5] -> 0[4] via P2P/CUMEM/read
10-7-133-248:2921365:2921405 [0] NCCL INFO Channel 05/0 : 0[4] -> 1[5] via P2P/CUMEM/read
10-7-133-248:2921366:2921406 [1] NCCL INFO Channel 06/0 : 1[5] -> 0[4] via P2P/CUMEM/read
10-7-133-248:2921365:2921405 [0] NCCL INFO Channel 06/0 : 0[4] -> 1[5] via P2P/CUMEM/read
10-7-133-248:2921366:2921406 [1] NCCL INFO Channel 07/0 : 1[5] -> 0[4] via P2P/CUMEM/read
10-7-133-248:2921365:2921405 [0] NCCL INFO Channel 07/0 : 0[4] -> 1[5] via P2P/CUMEM/read
10-7-133-248:2921366:2921406 [1] NCCL INFO Channel 08/0 : 1[5] -> 0[4] via P2P/CUMEM/read
10-7-133-248:2921365:2921405 [0] NCCL INFO Channel 08/0 : 0[4] -> 1[5] via P2P/CUMEM/read
10-7-133-248:2921366:2921406 [1] NCCL INFO Channel 09/0 : 1[5] -> 0[4] via P2P/CUMEM/read
10-7-133-248:2921365:2921405 [0] NCCL INFO Channel 09/0 : 0[4] -> 1[5] via P2P/CUMEM/read
10-7-133-248:2921366:2921406 [1] NCCL INFO Channel 10/0 : 1[5] -> 0[4] via P2P/CUMEM/read
10-7-133-248:2921365:2921405 [0] NCCL INFO Channel 10/0 : 0[4] -> 1[5] via P2P/CUMEM/read
10-7-133-248:2921366:2921406 [1] NCCL INFO Channel 11/0 : 1[5] -> 0[4] via P2P/CUMEM/read
10-7-133-248:2921365:2921405 [0] NCCL INFO Channel 11/0 : 0[4] -> 1[5] via P2P/CUMEM/read
10-7-133-248:2921366:2921406 [1] NCCL INFO Channel 12/0 : 1[5] -> 0[4] via P2P/CUMEM/read
10-7-133-248:2921365:2921405 [0] NCCL INFO Channel 12/0 : 0[4] -> 1[5] via P2P/CUMEM/read
10-7-133-248:2921366:2921406 [1] NCCL INFO Channel 13/0 : 1[5] -> 0[4] via P2P/CUMEM/read
10-7-133-248:2921365:2921405 [0] NCCL INFO Channel 13/0 : 0[4] -> 1[5] via P2P/CUMEM/read
10-7-133-248:2921366:2921406 [1] NCCL INFO Channel 14/0 : 1[5] -> 0[4] via P2P/CUMEM/read
10-7-133-248:2921365:2921405 [0] NCCL INFO Channel 14/0 : 0[4] -> 1[5] via P2P/CUMEM/read
10-7-133-248:2921366:2921406 [1] NCCL INFO Channel 15/0 : 1[5] -> 0[4] via P2P/CUMEM/read
10-7-133-248:2921365:2921405 [0] NCCL INFO Channel 15/0 : 0[4] -> 1[5] via P2P/CUMEM/read
10-7-133-248:2921366:2921406 [1] NCCL INFO Connected all rings
10-7-133-248:2921366:2921406 [1] NCCL INFO Connected all trees
10-7-133-248:2921365:2921405 [0] NCCL INFO Connected all rings
10-7-133-248:2921366:2921406 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
10-7-133-248:2921366:2921406 [1] NCCL INFO 16 coll channels, 0 collnet channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer
10-7-133-248:2921365:2921405 [0] NCCL INFO Connected all trees
10-7-133-248:2921365:2921405 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
10-7-133-248:2921365:2921405 [0] NCCL INFO 16 coll channels, 0 collnet channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer
10-7-133-248:2921366:2921406 [1] NCCL INFO comm 0xa0cd0f0 rank 1 nranks 2 cudaDev 1 nvmlDev 5 busId 95000 commId 0x1b99358c45e6af2 - Init COMPLETE
10-7-133-248:2921365:2921405 [0] NCCL INFO comm 0xa671100 rank 0 nranks 2 cudaDev 0 nvmlDev 4 busId 94000 commId 0x1b99358c45e6af2 - Init COMPLETE
[rank0]:[W1023 13:24:49.316696791 ProcessGroupNCCL.cpp:1168] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present,  but this warning has only been added since PyTorch 2.4 (function operator())

I see. for the working python env, does it run the torchtune recipe well?