Unusual CPU Stalls and Significant Training Speed Unstable During First Epoch

MrAru · November 7, 2025, 6:52am

Hi team,

I’m encountering a reproducible phenomenon where the training speed during the first epoch is unstable in a peculiar way. The training process starts very slow for the first ~30% of minibatches and then accelerates significantly, becoming much faster after roughly the 50% mark. This behavior is consistent across runs. I’m using pytorch lightning for simplify the training management.

What is the current behavior?

During the trainer.fit() call, the progress bar for the first training epoch shows a slow iteration speed (e.g., 2.5 it/s) for the initial few hundred batches. After this “warm-up” period, the speed dramatically increases (e.g., 25 it/s) and remains high for the rest of the epoch and subsequent validation stages.

What is the expected behavior?

I would expect the training speed to be relatively stable throughout the entire first epoch, perhaps with a very brief initial warm-up period of a few batches for things like kernel compilation, but not a sustained slowdown that lasts for a significant portion of the epoch.

Investigation and what I’ve ruled out

I have already investigated several common causes for this kind of slowdown:

I/O and Page Caching: My initial hypothesis was that the initial slow speed was due to uncached data reads, and the later speed-up was due to the dataset fitting into the system’s page cache.

Test: I isolated my custom Dataset and DataLoader from the Lightning training loop. I manually iterated through the DataLoader for a full epoch and measured the time for each next(iter(dataloader)) call.
Result: The data fetching time was stable and consistently fast, with no significant slowdown at the beginning. This suggests that the bottleneck is not in the data loading pipeline itself.

CUDA Kernel Compilation/JIT: I considered one-time costs like CUDA kernel compilation.

Reasoning: While some initial overhead is expected, this warm-up period seems far too long, persisting for hundreds of minibatches. A typical kernel compilation warm-up should be much shorter.

Environment or Model Specificity: To ensure this wasn’t an issue with a particular setup, I have:

Tried different, simpler models.
Run the training on different machines.
Result: The phenomenon (training speed increasingly speed up) is reproducible in all these scenarios.

I’ve conducted a further investigation into this behavior, and I can confirm that the issue appears to be a significant stall on the CPU execution stream immediately after launching a GPU kernel.

I used the PyTorch Profiler to perform a detailed performance analysis of an entire training epoch. To illustrate the problem, let’s look at the model’s forward pass at different stages.

Here is the profiler trace (flame graph) for the forward pass during the 30th minibatch:

Here is the trace for the same forward pass during the 240th minibatch:

And finally, here is the trace for the forward pass during the 600th minibatch:

Please note the wall duration of the forward function (common/model/models.py(72): forward) in these three traces: 76.2 ms, 28.6 ms, and 0.403 ms, respectively.

More specifically, if we examine the GPU kernels launched by PyTorch functions (e.g., linear, layer_norm, gelu in the first image, already noted on the image), we can observe a strange pause on the CPU after a kernel launch. The CPU does not immediately proceed to launch the next kernel. This phenomenon is not limited to the forward pass; it occurs throughout the entire training_step. Any operation that launches a device kernel from the host results in a stall in the host’s instruction stream. This is what causes the overall training time to be exceptionally slow at the beginning of the epoch. In the latter half of the epoch, this stalling behavior disappears.

I find this difficult to understand. I have verified that the GPU kernels being launched are all fast operations, with execution times far shorter than the duration of the CPU stalls. Furthermore, I am not manually invoking any GPU-CPU synchronization operations (like torch.cuda.synchronize()) in my model.

Are there any further avenues I could investigate, or does anyone have suggestions as to what might be causing this? Any help would be greatly appreciated.

Environment

Current environment


#- PyTorch Version (e.g., 2.5): 2.9.0

#- PyTorch Lightning Version (e.g., 2.5.0): 2.5.5

#- Python version (e.g., 3.12): 3.12

#- OS (e.g., Linux): Ubuntu 24.04

#- CUDA/cuDNN version: 12.8

#- GPU models and configuration: might irrelative

MrAru · November 7, 2025, 6:54am

And I also found this might be relative: CPU thread slow to enqueue GPU and communication kernels but i’m not sure.

xgbj · November 10, 2025, 3:33am

Hello, my friend. I’ve encountered similar issues before. Let me start by saying the essence of this phenomenon lies in PyTorch’s dynamic graph characteristics. To ensure the efficiency of PyTorch’s dynamic graph training, it’s crucial to guarantee that GPU kernel execution can be fully asynchronous and not blocked by CPU host’s kernel launch. The ideal scenario is for the CPU side to quickly process all training code and complete all kernel launches, then wait for the GPU execution to finish.

The biggest factor affecting the execution speed of code on the CPU side is Python’s GIL (Global Interpreter Lock). If another thread holds the Python GIL at that moment, the thread responsible for handling kernel launches can only wait. You can use NSight Systems (nsys) with Python GIL tracing enabled to track which thread is currently holding the GIL, and then adjust the relevant code. A simple example is to move data processing to a separate process, which can avoid data processing competing for the Python GIL.

Another solution is to switch PyTorch training from dynamic graph to static graph mode. You can use torch.compileand enable the reduce_overheadflag to leverage CUDA Graphs as much as possible, compiling multiple kernels into a single kernel. The purpose of this is to reduce the frequency of kernel launches from the CPU side. This way, even if there are stalls on the CPU side, as long as the GPU is running, it won’t affect your actual training speed.