Weird variation of computation times across platforms/machines

Hi. I am benchmarking a jit model that we want to integrate into our production pipeline. While this model has suitable performance among the different platforms that we use for developing/testing, it has pretty bad performance on the target machine (AWS G4dn.xlarge, GPU is Tesla T4)

I am testing the same model file in 3 configurations:

  • C++ standalone : dummy C++ execution, load the model and some random inputs
  • C++ integrated : the model running in the complete application
  • Torchscript : dummy execution in python, with torch.jit.load

What I do not understand is that this bad performance happens only in the integrated version on the EC2.
The C++ version are compiled locally on my PC and copied to the EC2
Times are consistent across executions. Times are averaged on 100 iterations, excluding the 2 first, where optimization happens. And yes, I am timing after sending the result .to(torch::kCPU), so this is not a CUDA synchronisation problem.
I am using libtorch 2.0.0
Any ideas as to where this might come from ? how can I debug this ?