No speedup from torch.complile()

Hey, just wondering why I’m getting almost no speedup from torch.compile in this toy example

With T4 GPU assigned.

Am I doing something wrong? I know it is a small example but should there not be at least some speedup?


You might want to read through this section of the docs. CC @marksaroufim

Unfortunately T4 chips won’t benefit as much from torch.compile because torch.compile primarily helps with memory bandwidth bound workloads as in workloads where the GPU is so fast that you can’t transfer data quickly into it, this is less of a problem on older GPUs since they’re slower. In addition newer GPUs also introduced tensor cores which is so critical for performance nowadays

So my suggestion is to find a notebook provider where you can pick your GPU and either get an A10G or A100