How to broadcast tensors using NCCL?

@vgoklani @Li_Shen

I answered this in another post *deadlock* when using torch.distributed.broadcast

Basically when you are adding collectives like broadcast, please make sure it’s called on all ranks rather than only rank 0 (like all_reduce in the script), this should resolve this issue. Let me know if it not works.