InfiniBand Vs TCP

Hello,

I am comparing the same workflow accross InfiniBand (IB) and TCP. the workflow is one of the first example from the torch tune repo.

IB compute time 37.8min NCCL_IB_DISABLE=0
TCP compute time 32min NCCL_IB_DISABLE=1

Should IB be faster than TCP?

If needed I can add the yaml config

adding to my previous post.

I run several experiments comparing runtime, between jobs using IB and jobs not using IB. the tests are run on the same hardware, 2 node with 8 h100 each.

I noticed that GPUs start being utilized after 621sec for NO_IB and 1081sec for IB.

the test is a fine-tuning a 7B model using torchtune.