Hi,

With pytorch 2.2.0 (but the issue was the same with version 2.1.2) on cpu, `torch.compile`

makes my code sometimes 2x slower. Here are my timings on a toy regression task:

wo/ `torch.compile`

Epoch 0: Loss: 5.15e-01 in 1.3s.

Epoch 1: Loss: 1.91e+01 in 1.3s.

Epoch 2: Loss: 6.26e+03 in 1.3s.

Epoch 3: Loss: 3.59e+04 in 1.4s.

Epoch 4: Loss: 7.14e+04 in 1.3s.

Epoch 5: Loss: 9.38e+05 in 1.3s.

Epoch 6: Loss: 1.49e+02 in 1.3s.

Epoch 7: Loss: 1.01e+00 in 1.3s.

Epoch 8: Loss: 8.76e+00 in 1.3s.

Epoch 9: Loss: 1.25e+00 in 1.3s.

Epoch 10: Loss: 1.39e+00 in 1.3s.

Epoch 11: Loss: 3.83e-01 in 1.3s.

Epoch 12: Loss: 5.26e-01 in 1.3s.

Epoch 13: Loss: 5.29e-01 in 1.3s.

Epoch 14: Loss: 5.71e-01 in 1.3s.

Epoch 15: Loss: 5.71e-01 in 1.3s.

Epoch 16: Loss: 5.55e-01 in 1.3s.

Epoch 17: Loss: 5.40e-01 in 1.4s.

Epoch 18: Loss: 5.25e-01 in 1.4s.

Epoch 19: Loss: 6.16e-01 in 1.4s.

Epoch 20: Loss: 1.37e+00 in 1.4s.

w/ `torch.compile`

Epoch 0: Loss: 5.15e-01 in 5.6s.

Epoch 1: Loss: 1.91e+01 in 1.3s.

Epoch 2: Loss: 6.26e+03 in 3.5s.

Epoch 3: Loss: 3.59e+04 in 2.4s.

Epoch 4: Loss: 7.14e+04 in 1.8s.

Epoch 5: Loss: 9.37e+05 in 3.6s.

Epoch 6: Loss: 1.49e+02 in 1.3s.

Epoch 7: Loss: 1.00e+00 in 1.3s.

Epoch 8: Loss: 8.85e+00 in 1.3s.

Epoch 9: Loss: 1.27e+00 in 1.3s.

Epoch 10: Loss: 1.39e+00 in 1.3s.

Epoch 11: Loss: 3.79e-01 in 1.4s.

Epoch 12: Loss: 5.18e-01 in 1.5s.

Epoch 13: Loss: 5.27e-01 in 1.5s.

Epoch 14: Loss: 5.69e-01 in 2.3s.

Epoch 15: Loss: 5.68e-01 in 4.5s.

Epoch 16: Loss: 5.53e-01 in 6.1s.

Epoch 17: Loss: 5.42e-01 in 5.6s.

Epoch 18: Loss: 5.36e-01 in 5.4s.

Epoch 19: Loss: 7.87e-01 in 4.8s.

Epoch 20: Loss: 1.98e+00 in 4.6s.

Those results can be reproduced with the following code:

```
import torch
import torch.nn as nn
import numpy as np
from time import time
torch.set_num_threads(2)
input_size = 1
output_size = 1
hidden_size = 1024
num_data = 10000
# seeds
seed = 1234
torch.manual_seed(seed)
np.random.seed(seed)
# hyper-parameters
num_epochs = 50
learning_rate = 0.01
# toy dataset
x_train = np.random.rand(num_data,input_size)
y_train = np.cos(2*np.pi*x_train) + 0.1*np.random.randn(num_data,input_size)
# regression model
model = nn.Sequential(nn.Linear(input_size, hidden_size),
nn.GELU(),
nn.Linear(hidden_size, hidden_size),
nn.GELU(),
nn.Linear(hidden_size, hidden_size),
nn.GELU(),
nn.Linear(hidden_size, hidden_size),
nn.GELU(),
nn.Linear(hidden_size, output_size))
# loss and optimizer
criterion = nn.MSELoss()
optimizer = torch.optim.NAdam(model.parameters(), lr=learning_rate)
if 1:
model = torch.compile(model, mode="reduce-overhead", fullgraph=True)
# train the model
x_train = torch.from_numpy(x_train.astype(np.float32))
y_train = torch.from_numpy(y_train.astype(np.float32))
for epoch in range(num_epochs):
start_time = time()
# forward pass
outputs = model(x_train)
loss = criterion(outputs, y_train)
# backward and optimize
optimizer.zero_grad(set_to_none=True)
loss.backward()
optimizer.step()
print(f'Epoch {epoch}: Loss: {loss.item():.2e} in {time()-start_time:.1f}s.')
```

Is my usage of `torch.compile`

correct ? Are there some options I could try to avoid this issue ?

Thanks for the help.