Not seeing any training time speedups when using torch.compile

Hello, I’m training a DLRM recommender system model (similar to DLRM: An advanced, open source deep learning recommendation model) with DCNv2 in PyTorch. I’m trying to see if applying torch.compile on my model prior to training can help to provide some training time speedup. I’ve removed as many graph breaks as I can (there are 5 remaining, which are not easy to remove), but I’m still seeing essentially zero difference in training time when I train without torch.compile vs. with torch.compile. In fact, the forward() and optimizer.step() times take slightly longer (~6-8% more) when using torch.compile. Most of my training step time is in the backward pass, and that doesn’t seem to be affected by torch.compile either. Is this expected? Even with a few graph breaks, I would have expected that torch.compile would be able to give some training time speed up. Also – should I expect torch.compile to have a different effect on training time if I train with 1 vs. multiple (e.g., 8) GPUs? Any insights here would be much appreciated. Thanks!

Have you taken a look at the performance section of torch.compile, the missing manual - Google Docs?

A few things I would look into are:

(1) having ~5 graph breaks might be fine and still give you good perf, or it might not, depending on the context. For example, if you have a graph break in a deeply nested region of code, Dynamo will generate a separate graph at each function boundary. One way to tell is by using tlparse (lots more detail in this section: torch.compile, the missing manual - Google Docs)

(3) Try running the pytorch profiler, and compare profiles between eager and torch.compile: Profiling to understand torch.compile performance — PyTorch 2.5 documentation

If you are able to generate any of the above outputs (tlparse output, or output of the profiler), sharing them would also be helpful to diagnosing the slowness!