Hi team,
I’m encountering a reproducible phenomenon where the training speed during the first epoch is unstable in a peculiar way. The training process starts very slow for the first ~30% of minibatches and then accelerates significantly, becoming much faster after roughly the 50% mark. This behavior is consistent across runs. I’m using pytorch lightning for simplify the training management.
What is the current behavior?
During the trainer.fit() call, the progress bar for the first training epoch shows a slow iteration speed (e.g., 2.5 it/s) for the initial few hundred batches. After this “warm-up” period, the speed dramatically increases (e.g., 25 it/s) and remains high for the rest of the epoch and subsequent validation stages.
What is the expected behavior?
I would expect the training speed to be relatively stable throughout the entire first epoch, perhaps with a very brief initial warm-up period of a few batches for things like kernel compilation, but not a sustained slowdown that lasts for a significant portion of the epoch.
Investigation and what I’ve ruled out
I have already investigated several common causes for this kind of slowdown:
I/O and Page Caching: My initial hypothesis was that the initial slow speed was due to uncached data reads, and the later speed-up was due to the dataset fitting into the system’s page cache.
-
Test: I isolated my custom Dataset and DataLoader from the Lightning training loop. I manually iterated through the DataLoader for a full epoch and measured the time for each next(iter(dataloader)) call.
-
Result: The data fetching time was stable and consistently fast, with no significant slowdown at the beginning. This suggests that the bottleneck is not in the data loading pipeline itself.
CUDA Kernel Compilation/JIT: I considered one-time costs like CUDA kernel compilation.
- Reasoning: While some initial overhead is expected, this warm-up period seems far too long, persisting for hundreds of minibatches. A typical kernel compilation warm-up should be much shorter.
Environment or Model Specificity: To ensure this wasn’t an issue with a particular setup, I have:
-
Tried different, simpler models.
-
Run the training on different machines.
-
Result: The phenomenon (training speed increasingly speed up) is reproducible in all these scenarios.
I’ve conducted a further investigation into this behavior, and I can confirm that the issue appears to be a significant stall on the CPU execution stream immediately after launching a GPU kernel.
I used the PyTorch Profiler to perform a detailed performance analysis of an entire training epoch. To illustrate the problem, let’s look at the model’s forward pass at different stages.
Here is the profiler trace (flame graph) for the forward pass during the 30th minibatch:
Here is the trace for the same forward pass during the 240th minibatch:
And finally, here is the trace for the forward pass during the 600th minibatch:
Please note the wall duration of the forward function (common/model/models.py(72): forward) in these three traces: 76.2 ms, 28.6 ms, and 0.403 ms, respectively.
More specifically, if we examine the GPU kernels launched by PyTorch functions (e.g., linear, layer_norm, gelu in the first image, already noted on the image), we can observe a strange pause on the CPU after a kernel launch. The CPU does not immediately proceed to launch the next kernel. This phenomenon is not limited to the forward pass; it occurs throughout the entire training_step. Any operation that launches a device kernel from the host results in a stall in the host’s instruction stream. This is what causes the overall training time to be exceptionally slow at the beginning of the epoch. In the latter half of the epoch, this stalling behavior disappears.
I find this difficult to understand. I have verified that the GPU kernels being launched are all fast operations, with execution times far shorter than the duration of the CPU stalls. Furthermore, I am not manually invoking any GPU-CPU synchronization operations (like torch.cuda.synchronize()) in my model.
Are there any further avenues I could investigate, or does anyone have suggestions as to what might be causing this? Any help would be greatly appreciated.
Environment
Current environment
#- PyTorch Version (e.g., 2.5): 2.9.0
#- PyTorch Lightning Version (e.g., 2.5.0): 2.5.5
#- Python version (e.g., 3.12): 3.12
#- OS (e.g., Linux): Ubuntu 24.04
#- CUDA/cuDNN version: 12.8
#- GPU models and configuration: might irrelative