I was trying to use 3 gpus for training, however the train time changes very negligibly. When training on one gpu I set the batch size to 200, when training with 3 I set it to 1000.
The gpu consumption is distributed across the 3 gpus, however the training time is still the same for one epoch.
you could time the execution speed of dataloading, model execution etc to see what’s the real bottleneck. Are you sure it’s the model, but not disk io?
If you are using dataloader for your training loop, you can measure dataloading time simply like below.
loader_time, st = 0, time.time()
for i, data in enumerate(loader):
loader_time += time.time() - st
# sth for training
# ...
st = time.time()
# end of the loop
Then you can measure the proportion of dataloading time from whole training.
Or the better way is to do the profiling like this.