Model forward suddenly slows down

Hi, I run the following code for testing my model on a single GPU:

model.eval()
for i, x in enumerate(test_inputs):
    t0 = time.time()
    with torch.no_grad():
        y = model(x)
    print(f'forward time: {time.time() - t0:.4f}s')

The model is a simple MLP with 11 layers, #neurons in each layer is 1024. Inputs is a tensor of shape [50, 1024].

My problem is: For the first 23 loops, the forward time is like 0.0015s. While after the #23 loop, the forward time suddenly jumps to ~0.28s and stay at this level afterwards. I am wondering why the forward speed suddenly slows down at some point while the previous loops work really fast? I feel this is related to pytorch cuda cache. What would you suggest to resolve this problem? Thanks!

My environment:
OS: Ubuntu20.04
CUDA: 11.2
GPU: Tesla V100 16GB
Python: 3.9.5
Pytorch1.9.0

Hi @MingSun-Tse

Is it possible that the test_inputs is a dataloader and the loading of the last Xs just slower then the first ones?

Hi @zetyquickly , thanks for your reply! test_inputs is not a dataloader. It is simply a tensor. For each forward, the x is exactly of the same shape.

May you write the same test on colab, and see is there an issue? Btw send the link to that colab notebook

Apart from software (pytorch, python, etc) related problems, you may also suspect hardware issues.

For instance, below is an observation that I had come across. I am not sure if this is your case too.
If you try noticing the temperature of GPU in nvidia-smi and if the temperature >= 90C, GPU might enforce software/hardware throttling that would slowdown the GPU.
In my case, for the first few loops, the temperature was less, so it was fast. However, during training due to the rise in temperature, throttle came into effect.

CUDA operations are executed asynchronously, so to get a valid timing you would have to manually synchronize the code before starting and stopping the timer via torch.cuda.synchronize().
In your current code you might ptofile the kernel launches and the first blocking operation would accumulate the timings of other kernels.

1 Like

Hi @zetyquickly , thank you! Sorry it is a private project, so I cannot share the exact testing code on colab.

Hi @InnovArul , thank you! I checked the temperature. Idle GPU temperature is 34°C, during inference it is 53°C, so it is not about the temperature.

@ptrblck , thank you! I tried to add torch.cuda.synchronize() at the end of each for loop. Indeed, it works! The time of each model forward now is consistently ~0.0015s. The only problem is, torch.cuda.synchronize() itself takes ~0.27s, so the complete testing takes nearly the same total time as before. I guess this points out where is the problem. I also note this issue was reported before. Now I am wondering, if this can be resolved somehow. Any clue?

Thank you ALL!

It’s not an “issue”, but just how asynchronous execution works.
The wrong timings are created since the CPU can launch the CUDA kernels and continue executing the host code until a synchronization is needed. In the meantime the GPU is working in the background.
If you thus start the timer, launch the kernel, and directly stop the timer afterwards, the GPU might stil be busy and you would not profile the actual GPU workload.
As already described, you need to synchronize the code to be able to profile the kernel timings on the GPU or the better way would be to use torch.utils.benchmark, which was created for use cases like these.

3 Likes

Hi @ptrblck, I see. Great explanation! Thanks!