DDP hangs when initializing

Hi everyone, I am following this tutorial Huggingface Knowledge Distillation and my process hangs when initializing DDP model this line
I add the following env variables NCCL_ASYNC_ERROR_HANDLING=1 NCCL_DEBUG=DEBUG TORCH_DISTRIBUTED_DEBUG=DETAIL for showing logs:
Here is the full log:

Traceback (most recent call last):
  File "main.py", line 137, in <module>
    main()
  File "main.py", line 130, in main
    distiller = Distiller(
  File "/workspace/ProtonX-MLE/distiller.py", line 176, in __init__
    self.student = DistributedDataParallel(
  File "/root/miniconda3/envs/py38/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 674, in __init__
    _verify_param_shape_across_processes(self.process_group, parameters)
  File "/root/miniconda3/envs/py38/lib/python3.8/site-packages/torch/distributed/utils.py", line 118, in _verify_param_shape_across_processes
    return dist._verify_params_across_processes(process_group, tensors, logger)
RuntimeError: ProcessGroupWrapper: Monitored Barrier encountered error running collective: CollectiveFingerPrint(OpType=ALLGATHER, TensorShape=[1], TensorDtypes=Long, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt))). Error: 
[../third_party/gloo/gloo/transport/tcp/pair.cc:598] Connection closed by peer [10.42.183.23]:11911
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 1852) of binary: /root/miniconda3/envs/py38/bin/python
Traceback (most recent call last):
  File "/root/miniconda3/envs/py38/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/root/miniconda3/envs/py38/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/root/miniconda3/envs/py38/lib/python3.8/site-packages/torch/distributed/launch.py", line 196, in <module>
    main()
  File "/root/miniconda3/envs/py38/lib/python3.8/site-packages/torch/distributed/launch.py", line 192, in main
    launch(args)
  File "/root/miniconda3/envs/py38/lib/python3.8/site-packages/torch/distributed/launch.py", line 177, in launch
    run(args)
  File "/root/miniconda3/envs/py38/lib/python3.8/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/root/miniconda3/envs/py38/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/root/miniconda3/envs/py38/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
main.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-07-31_03:46:54
  host      : kb-dialogue-train-4-5cd5656dbd-qw7fl
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 1852)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

Can anyone help me out ?

Note that I successfully run DDP with huggingface Trainer

From the error, looks like when you load the student model, the params are different across different ranks?

Thanks for your response! It was my mistake when I modified the code and incorrectly indented the code.