How should I use torch.compile properly?

Hi

  • I answered the question on stack overflow, one aspect that’s important to note is that V100 improvements are alright for torch.compile but the real speedups will come from A100 or A10G. And yes indeed nothing will happen when you torch.compile only, the compilation will happen at the time of the first inference. The name is sorta bad, it should be called torch.jit but that was already taken XD
  • Regarding your question on distributed, today the distributed support is not very fleshed. There’s a tradeoff if you compile the DDP module then torch.compile should be able to trace the communication and do more optimizations there but it doesnt today so stuff is likely to break so it’s safer to compile the inner module but this will likely evolve in our next releases
  • It’s not a huge deal either way, inductor would just prefer it if you don’t have too many device copies but that’s not gonna break it or cripple perf