Can't understand NCCL error

I am finetuning a llama-2-7b using FSDP on 2 A100 80GBs. My pytorch version is 2.0.1+cu118. My model runs for 15 batches/steps and then crashes with the following stack trace:

[E ProcessGroupNCCL.cpp:828] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=388923, OpType=_ALLGATHER_BASE, Timeout(ms)=1800000) ran for 1804615 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:828] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=388923, OpType=_ALLGATHER_BASE, Timeout(ms)=1800000) ran for 1804360 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=388923, OpType=_ALLGATHER_BASE, Timeout(ms)=1800000) ran for 1804360 milliseconds before timing out.
Fatal Python error: Aborted

Thread 0x00002ac6175f7700 (most recent call first):
  File "/cluster/project/sachan/kushal/math/lib64/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2532 in all_gather_into_tensor
  File "/cluster/project/sachan/kushal/math/lib64/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1451 in wrapper
  File "/cluster/project/sachan/kushal/math/lib64/python3.8/site-packages/torch/distributed/fsdp/flat_param.py", line 987 in _all_gather_flat_param
  File "/cluster/project/sachan/kushal/math/lib64/python3.8/site-packages/torch/distributed/fsdp/flat_param.py", line 919 in unshard
  File "/cluster/project/sachan/kushal/math/lib64/python3.8/site-packages/torch/distributed/fsdp/_runtime_utils.py", line 329 in _unshard
  File "/cluster/project/sachan/kushal/math/lib64/python3.8/site-packages/torch/distributed/fsdp/_runtime_utils.py", line 1007 in _prefetch_handles
  File "/cluster/project/sachan/kushal/math/lib64/python3.8/site-packages/torch/distributed/fsdp/_runtime_utils.py", line 643 in _pre_backward_hook

Thread 0x00002ac43d076700 (most recent call first):
<no Python frame>

It seems like there’s a race condition or synchronization issue happening here. How should I further debug this?
Another concern that I have is FSDP seems to running the forward pass in parallel on the 2 GPUs. My understanding was that FSDP shards the model across the GPUs and runs a single forward/backward pass across both processes. Is my understanding incorrect?
PS: compute-sanitizer and CUDA_LAUNCH_BLOCKING do not help in tracking down this error.

NCCL might be the victim here only re-raising errors from your other post.