Error: accessing tensor output of CUDAGraphs that has been overwritten by a subsequent run

Im trying to compile my model using torch.compile with mode=‘reduce-overhead’ on a machine with CUDA, and no matter what I do I get a runtime error saying “Error: accessing tensor output of CUDAGraphs that has been overwritten by a subsequent run”.
Here is a minimal example reproducing the issue:

import torch
import torch.nn as nn

device = 'cuda'
model = nn.Linear(10, 1).to(device)
model = torch.compile(model, fullgraph=True, mode='reduce-overhead')

for _ in range(5):
    torch.compiler.cudagraph_mark_step_begin()

    input = torch.randn((10,), dtype=torch.float32, requires_grad=True, device=device)
    outputs = model(input).clone()
    loss = outputs.sum().clone()
    loss.backward()

I get an error:

Traceback (most recent call last):
  File "runme.py", line 15, in <module>
    loss.backward()
  File "...lib/python3.12/site-packages/torch/_tensor.py", line 626, in backward
    torch.autograd.backward(
  File "...lib/python3.12/site-packages/torch/autograd/__init__.py", line 347, in backward
    _engine_run_backward(
  File "...lib/python3.12/site-packages/torch/autograd/graph.py", line 823, in _engine_run_backward
    return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Error: accessing tensor output of CUDAGraphs that has been overwritten by a subsequent run. Stack trace: File "...lib/python3.12/site-packages/torch/_dynamo/external_utils.py", line 45, in inner
    return fn(*args, **kwargs). To prevent overwriting, clone the tensor outside of torch.compile() or call torch.compiler.cudagraph_mark_step_begin() before each model invocation.

Im using torch ‘2.6.0+cu124’, CUDA Version 12.4, Ubuntu 22.04.5, with NVIDIA A10G.
Any help appreciated