Image_process.postprocess slow after torch.compile

Hey all! I am new to torch.compile and using torch.compile to speed up inference.

When apply torch.compile to a Flux pipeline transformer the vae image post processing function gets really slow (from sub 1s to 10s). Seems to be spending most of the time on sending the tensors to the CPU ({method ‘cpu’ of ‘torch._C.TensorBase’ objects}).

Would like to understand why this is and if I am doing something wrong. Happy to answer any questions.

For some additional background:

  • I am working with a custom breakout of the Flux pipeline inference call in diffusers to split the pipeline across two GPUs with the transformer on GPU 0 (RTX 4090) and the rest of the components on GPU 1 (RTX 4060 ti).
  • VAE is not compiled. The VAE decode process takes very little time on GPU 1.
  • Timing of postprocess function without torch.compile on the transformer is 0.75 seconds.
  • Using mode="reduce-overhead" results in a postprocess time being 3.5 seconds