Torch.compile() before or after .cuda()

zhihao_zhu · March 27, 2023, 11:10pm

Hi, should I call torch.compile() before or after moving the model to the GPU (i.e. call cuda()). Or it doesn’t matter? Because for some model I noticed run torch.compile() before .cuda() actually slows down the model inference speed. Another similar questions: should I call torch.compile() before or after model = torch.nn.parallel.DistributedDataParallel(model) for DDP?

marksaroufim · March 28, 2023, 5:50am

Call .cuda() before torch.compile and compile before passing in to DDP or FSDP

zhihao_zhu · March 28, 2023, 5:36pm

Thanks for the answer! That’s aligned with my experiment results. Btw, may I ask what’s the reason for this ordering, just trying to understand the Pytorch 2 fundamentals a bit more. Thanks in advance!

marksaroufim · March 28, 2023, 10:07pm

It’s best to call .cuda() first since then Inductor doesn’t need to reason about Device Copies, just makes the compilers job simpler

It’s best to call before DDP because the communication collectives aren’t traced - they will be though in the future [RFC] PT2-Friendly Traceable, Functional Collective Communication APIs · Issue #93173 · pytorch/pytorch · GitHub

marksaroufim · August 17, 2023, 5:45pm

For people coming to this issue please keep in mind that the instructions I gave for DDP are out of date

Please follow instructions here instead [Inductor] Run compiled model failed on 2023_08_17 nightly · Issue #107362 · pytorch/pytorch · GitHub