I am hoping to understand something that I have been observing for some time already and that I cannot seem to understand. I am quite new to Deep Learning and PyTorch so please bear with me.
I have been training a network on a workstation’s GPU. It’s a network that works with EEG signals (N channels, M time samples per channel), so the inputs are three dimensional batches of shape (batch_size, N, M). I have around 100k of these samples.
I have been training different configurations of the network, with different parameters. Last week I tried to run a simple parameter study, to analyze the impact of one parameter on the performance of the model. To do this, I have a script that iterates over the different values of the parameters. For each value, the network class is instantiated, the network trained and the final plots and results obtained. The initialization of the dataloaders take place outside of this loop since they are quite slow.
What I have observed is that the first iteration of this for loop takes ~6x more time than the following iterations.
The results seem to be consistent, the final results show that even if the first network has less parameters than the following, it’s the slowest to train by a huge difference. I have read that it’s possible that PyTorch might cache some operations, but it does not sense to me to observe this huge speed up.
Is this behavior possible? Or do I definitely have a bug that I have still not found?
In case it’s relevant, when training, the network occupies around a 10% of the GPU’s VRAM.
Thank you very much for any type of help you can provide,
It depends on what operations are included in the first iteration of your loop. If the first iteration is doing things like instantiating the first CUDA tensor, calling the first CUDA kernels, etc., then the performance difference could be expected as there are overheads associated with creating the CUDA context and loading kernels into GPU memory. Additionally, if you are doing convolutions that have the same shape in subsequent iterations, the heuristics determining e.g., kernel selection would be run once and then cached for subsequent iterations.
It’s common practice to throw out the first/several warmup iterations when doing model benchmarking, as even factors such as GPU clocks ramping up from an idle power state will perturb the results.
Thank you for your quick response. However, I do not yet quite understand the problem I have.
Each iteration of my for loop consists of instantiating a new model and training it. I understand that the first epochs of this training process, for each iteration, might be slower than the rest of the training process due to having to initialize CUDA kernels and instantiating . However, to me, it’s not clear that this results in such a considerable difference in time (first iteration ~6h, others ~1h).
Furthermore, the same architecture of the model seems to have different performance scores depending on what iteration of this for loop it has been trained on. If it’s been trained on the first iteration (slowest), the performance on the test set tends to be considerably better than if it’s trained in the second or third iteration of the for loop.
Is this also consistent with the warming up you mentioned? If yes, is there any literature I can read describing these situations and how best to deal with them?