Compiled_autograd Causes Abnormal Model Training Behavior

I’m encountering an issue with enabling compiled_autograd in PyTorch, which is causing my model’s training behavior to differ significantly.

Problem Description

Under the same environment, model, and dataset, toggling the following configuration:

torch._dynamo.config.compiled_autograd = True

produces a noticeably different trend in loss and gradient norm compared to setting it to False.

Observations

  • When compiled_autograd is off (False), the loss and gradient norm trends behave as expected.
  • However, turning it on (True) results in abnormal trends.

Visual Comparison

I’ve attached the plots below:

  • Red: compiled_autograd = False
  • Gray: compiled_autograd = True


Environment Details

  • PyTorch version: 2.5.1
  • CUDA version: 12.1
  • GPU: NVIDIA GeForce RTX 3090
  • Operating System: ubuntu 22.04

Any insights or suggestions would be greatly appreciated!

This sounds indeed like a bug. Do you see this issue in all of your tests or models or only for a specific use case? In the latter case, could you share a minimal and executable code snippet reproducing the issue?

1 Like