Significant Slowdown in Training Speed After Using compile() in PyTorch 2.1.0

Hello everyone,

I’m currently experiencing a substantial decrease in training speed after using the compile() function in my PyTorch model. Before the inclusion of compile(), my training iterations were completed in approximately 1 second per iteration. However, after integrating compile(), the speed has drastically reduced to about 70 seconds per iteration even after the 1st iteration.

This slowdown seems to be predominantly occurring during the model’s forward pass and the backward pass of the loss function. I haven’t integrated any third-party libraries.

I’m looking for advice on how to profile and pinpoint the source of this slowdown. I’d greatly appreciate your input.

Environment Details:

  • GPU: NVIDIA RTX 3090
  • PyTorch Version: 2.1.0
  • CUDA Version: 11.8
  • cuDNN Version: 8
  • Docker Image: pytorch/pytorch:2.1.0-cuda11.8-cudnn8-devel

Thank you in advance for any assistance or insights you can provide!