No appreciable difference when using multiple gpus

I was trying to use 3 gpus for training, however the train time changes very negligibly. When training on one gpu I set the batch size to 200, when training with 3 I set it to 1000.
The gpu consumption is distributed across the 3 gpus, however the training time is still the same for one epoch.

What are the possible causes for the same?

you could time the execution speed of dataloading, model execution etc to see what’s the real bottleneck. Are you sure it’s the model, but not disk io?

Any pointers to time the execution speed of dataloading?

If you are using dataloader for your training loop, you can measure dataloading time simply like below.

loader_time, st = 0, time.time()
for i, data in enumerate(loader):
    loader_time += time.time() - st
    # sth for training
    # ...
    st = time.time()
    # end of the loop

Then you can measure the proportion of dataloading time from whole training.

Or the better way is to do the profiling like this.