Strange performance behaviour

Hello everyone,
I’m working on some code and I’m experiencing a very strange behaviour when observing the performances of my model. I’m building a predictive coding framework, which means I’m dealing with a lot of independent layers that perform forward and backward pass. In both the training and evaluation mode, therefore, I need to iterate through this process of single layer forward and backward pass hundreds of times to obtain my results. However the training pass is more complex as I’m also updating the weights of the model, while in the evaluation only iterating to reach convergence. (It’s not important if you are not familiar with predictive coding, my question is more generally related to pytorch and cuda).

I’m experimenting on 3 different machines:

  • a server with a xeon@2.40ghz and a titan xp
  • a server with a xeon gold@2.30ghz and a titan rtx
  • a laptop with a 8750h and a rtx2070 max-q

and i’m observing the following behaviours (the two servers behave similarly so i will just treat them as one, it’s just to highlight that the strange behaviour is not due to some hardware/software error as is unlikely that both have the same problem).
my laptop is obviously slower in training mode, let’s say 3000ms per epoch than the servers (around 1000ms per epoch) while it is faster in evaluation mode (let’s say 250ms vs 350ms for the servers). My conclusion was that somehow my code is cpu bound as in theory the single core speed of my laptop should be faster I guess. Furthermore, in training mode my gpu utilization is 100%, while in evaluation is way lower (50% on the laptop and 20% on the servers). I have no idea what is causing the bottleneck, if it is actually the cpu or if somehow i’m not fully utilizing all the gpu (as, for example, there are parameters in the evaluation mode that i’m not training even if i’m using it for prediction and so I’m “reserving” some space for them, but then not use it).
Anyway, the strange thing is that sometimes/randomly both the servers decide to go into “turbo mode” and performing way faster: let’s say 500ms for the training and 175ms for the evaluation. These bursts last from 2 seconds to around 20 seconds and are random (although they seem more frequent in evaluation mode than training mode). And this leaves me speechless: if the process was cpu bounded how can it suddenly become way faster (even faster than my laptop which has a faster single core cpu?); furthermore, if the gpu utilization was already 100% (according to nividia-smi) how can suddenly become faster?
So my only guess is that cuda/pytorch is doing some sketchy optimization behind the hood, and since it’s basically doubling the performance of the library I’d really like to understand what it is.

Additional information:

  • my laptop doesn’t have these “boost phases” but both servers do.

  • I checked and it’s not related to other people using the server

  • my training loop is rather simple: forward pass, loss function, backward, optimizer, scheduler.

  • i’m sending the loss to the cpu with non_blocking=True to be used by the scheduler in the following iteration to avoid synchronization steps. I’m pretty sure I don’t have any of them, but is there a way/tool to check? (also my gpu usage is 100% so i guess no synch is happening).

  • feel free to ask for any more info. Unfortunately I can’t post the code as it’s quite long and not publicly available yet, but i’ll try my best to give as many infos as possible.

Thank you to everyone!

I would recommend to profile the code using the PyTorch profiler or e.g. Nsight Systems to check, if your script is CPU bound or where the described “boosts” come from.

I tried already to use the pytorch profiler but of course I couldn’t manage to obtain the boost while using it. Do you think it may be different with the other tool? I don’t have any admin rights on the server and installing any new tool is generally a long procedure.