Hi, I am doing my training on two networks of different structures, one is much larger than another (have more parameters), why the larger one actually takes much less time to train (time per iteration)? Is this possible? Or you would suggest that this might be bugs. Is there any way that I can verify that my architectures are implemented correctly?
If you are using the GPU, did you synchronize the code before starting and stopping the timer?
This would be necessary, since CUDA operations are asynchronous and thus non-blocking.
If so, what does small/large refer to? Is it the depth of the model or also the number of parameters?
I will check the synchronization, thanks! I have torch.cuda.synchronize() at the beginning of the script. Is this what you refer to?
Even though the ‘time per iter’ might not be accurate here, it also takes a much longer time to train over the same number of epochs with the same batch size, the same dataset.
By the size of the model, I meant to be the number of parameters.
Yes, but it would be needed before starting and stopping all timers.
Otherwise your profile will accumulate the times in the next blocking operation and will yield wrong results.