VXLAN impact on nccl broad casting

Hi, we launch PytorchJob on our kubernetes cluster (1.20) to fine tune bert model with nccl backend transformers/run_language_modeling.py at v2.8.0 · huggingface/transformers · GitHub
We use NCCL version 2.10.3+cuda10.2

We explicitly schedule master pod and worker pod on two different kuberntes node with v100 GPU, after we enable VXLAN we find that the data transfer time is 2-3 times slower. We then enable jumbo frame and increase MTU to 9000, but it still doesn’t not have any improvement.

Does anyone experience same issue with VXLAN ?

I would recommend to post this issue in the NCCL repository as it doesn’t seem to be PyTorch-specific and the NCCL devs might help out.