I run several experiments comparing runtime, between jobs using IB and jobs not using IB. the tests are run on the same hardware, 2 node with 8 h100 each.
I noticed that GPUs start being utilized after 621sec for NO_IB and 1081sec for IB.
the test is a fine-tuning a 7B model using torchtune.