Hi, we launch PytorchJob on our kubernetes cluster (1.20) to fine tune bert model with nccl backend transformers/run_language_modeling.py at v2.8.0 · huggingface/transformers · GitHub
We use NCCL version 2.10.3+cuda10.2
We explicitly schedule master pod and worker pod on two different kuberntes node with v100 GPU, after we enable VXLAN we find that the data transfer time is 2-3 times slower. We then enable jumbo frame and increase MTU to 9000, but it still doesn’t not have any improvement.
Does anyone experience same issue with VXLAN ?