Hi, should I call torch.compile() before or after moving the model to the GPU (i.e. call cuda()). Or it doesn’t matter? Because for some model I noticed run torch.compile() before .cuda() actually slows down the model inference speed. Another similar questions: should I call torch.compile() before or after model = torch.nn.parallel.DistributedDataParallel(model) for DDP?
Call .cuda() before torch.compile and compile before passing in to DDP or FSDP
Thanks for the answer! That’s aligned with my experiment results. Btw, may I ask what’s the reason for this ordering, just trying to understand the Pytorch 2 fundamentals a bit more. Thanks in advance!
It’s best to call .cuda()
first since then Inductor doesn’t need to reason about Device Copies, just makes the compilers job simpler
It’s best to call before DDP because the communication collectives aren’t traced - they will be though in the future [RFC] PT2-Friendly Traceable, Functional Collective Communication APIs · Issue #93173 · pytorch/pytorch · GitHub
For people coming to this issue please keep in mind that the instructions I gave for DDP are out of date
Please follow instructions here instead [Inductor] Run compiled model failed on 2023_08_17 nightly · Issue #107362 · pytorch/pytorch · GitHub