Torch.compile() negative performance

If I run the example, as is, from:

l get

(eval) eager median: 0.0010229760408401488, compile median: 0.0010844160318374634, speedup: 0.9416660956057048x

or about 6% slower. The tutorial guy say he sees a 2.3x speedup on his A100. I have a 5.8GHz i9-13900K and a 4090 on Ubuntu.

As I’m learning this stuff I’ve noticed that CPU performance plays a very significant factor in whether recommended optimizations do or do NOT work. If I run my this test on my 4.3GHz E-cores then I see a good 1.82X perf boost. In the work I do with Stable DIffusion I’ve noticed many people using a 4090 do NOT have the same uber fast single core CPU(5.8GHz) as I have and with a 4090 this becomes a limiting factor for some workloads. In particular the workload used to benchmark the common batchsize=1 512x512 image generation. Confusion reigns supreme on some boards with people arguing about whether opt-channelslast helps or hurts and the same thing for torch.backends.cudnn.benchmark. For me they both help, however I discovered that if I run at 4.3 GHz then one of them hurts perf and the other makes no difference. It is as if the slower CPU can’t take advantage of software improvements if there is no CPU head room left as there is for a 5.8 GHz CPU.

But even my 5.8 GHz cpu runs out of steam when doing torch.compile() on the SD application. I say that because I get a modest 7% perf improvement but then the GPU is only about 88% busy instead of 98% busy. That seems to imply that even a 5.8GHz single core can’t push a highly optimized 4090 to the max. However, when using a larger batchsize, optimization does get me about a 15% speedup and gets me close to 100% busy on the GPU. While that might not seem like that much you have to realize that before I even tried the torch.compile() I was getting uber fast numbers like 43 it/s which might be the fastest anyone is doing in AUTOMATIC1111 Stable Diffusion.

Back to the tutorial. Given my findings that these other optimizations don’t help on a slower cpu I’m surprised the opposite is true for the tutorial example. There it shows a speed up(82%) using compile but only when the cpu is slower for this SIMPLE test.

And it gets weirder. Given this test only pushes the GPU to less than 40% busy I increased the batch size from 16 to 64 in the test. Then the perf in the 4 combinations of (fast/slow cpu) and (compiled/not compiled) were all about the same. Compile with bachsize=64 is now 2% faster on the fast cores and <1% slower on the slow cores. Doesn’t make much sense.

About the only thing that makes sense although it is just my theory is that given that the 4090 is a very fast piece of hardware, if you also run highly optimized algorithms on it even the fastest CPU’s might not be able to push it to its fullest in some common workloads. There are ways around that by batching or running multiple app instances threads on the GPU at the same time. Presuming I can share the large memory footprint model on the GPU between two threads in the app process I should be able to push the GPU to 100% busy and get the most out of it. There were 4090 owners out there running at about 1/4th the perf I got until I told them about the CUDNN 8.7 fix. Those that only got 2X to 3X faster on the 4090 have slower CPU’s. If one is going to buy the most expensive consumer card they need to know how to get the most out of it because if used properly it is quite fast.

1 Like

@aifartist we haven’t unfortunately ran too many experiments on a 4090 since not many people on the team have one.

Generally though it sounds like you might benefit from torch.compile(model, mode="reduce-overhead")`

Regardless you mentioned people being confused on discussion boards about what to do next, do you mind inviting me? Would be happy to take a closer look