Torch.compile() before or after .cuda()

Hi, should I call torch.compile() before or after moving the model to the GPU (i.e. call cuda()). Or it doesn’t matter? Because for some model I noticed run torch.compile() before .cuda() actually slows down the model inference speed. Another similar questions: should I call torch.compile() before or after model = torch.nn.parallel.DistributedDataParallel(model) for DDP?

Call .cuda() before torch.compile and compile before passing in to DDP or FSDP

Thanks for the answer! That’s aligned with my experiment results. Btw, may I ask what’s the reason for this ordering, just trying to understand the Pytorch 2 fundamentals a bit more. Thanks in advance!

It’s best to call .cuda() first since then Inductor doesn’t need to reason about Device Copies, just makes the compilers job simpler

It’s best to call before DDP because the communication collectives aren’t traced - they will be though in the future [RFC] PT2-Friendly Traceable, Functional Collective Communication APIs · Issue #93173 · pytorch/pytorch · GitHub

1 Like

For people coming to this issue please keep in mind that the instructions I gave for DDP are out of date

Please follow instructions here instead [Inductor] Run compiled model failed on 2023_08_17 nightly · Issue #107362 · pytorch/pytorch · GitHub