Understanding bottlenecks in training loop speed

I’m trying to understand something about loops in PyTorch. I timed this function (modified from docs):

    for batch, (X, y) in enumerate(dataloader):
        print("Shape: ", X.shape)
        X, y = X.to(device), y.to(device)
        for i in range(10000):
            # Compute prediction error
            pred = model(X)
            loss = loss_fn(pred, y)

            # Backpropagation


This function will run training on a single batch 10,000 times (I understand this is pointless as it would just memorize that specific batch).

The reason I wrote it is to understand why all of the following are true:

  • This ran for a really small network ([784 in] → [512 dense] → ReLU → [10 out])
  • I ran this on a 4090. It’s a 24gb card with 16k cuda cores and 2.3ghz clock speed.
  • The time it takes to complete this is about 3.6 seconds.

From what I understand about deep learning on GPUs, 3.6 seconds seems way too high.

If the GPU has plenty of memory to keep the entire model (along with state for the batch and gradients), 10,000 iterations for a 3 layer network should complete well under a few hundred thousand clock cycles. (This could also be a complete misunderstanding of how this works, please feel free to correct me if it’s wrong)

My best theory for why:
The for loop runs on the CPU, so each iteration, the GPU communicates back to the CPU that 1 iteration is complete, and the for loop continues. But even that communication time should be marginal and shouldn’t cause such a slow training time?

I can’t find any answers on a) whether it should be faster, and b) why it isn’t.

I’d appreciate any insights. Thanks.

CUDA operations are executed asynchronously and since your code is mussing synchronizations it will profile the dispatching and kernel launch overheads at best.
You are right that the CPU is used to schedule the work and especially if your actual GPU workload is tiny these overheads might be visible as “whitespaces” in a visual profile created by the native PyTorch profiler or e.g. Nsight Systems.
In these cases your workload is CPU-limited and the GPU is “starving”.
If these tiny models represent real workloads you could check CUDA graphs, which would eliminate the kernel launch overheads as described here.

1 Like

Try increasing your batch size instead of running 10,000 iterations.

You can also make use of half precision or mixed precision for an additional speed up and larger batches.