Im trying to compile my model using torch.compile with mode=‘reduce-overhead’ on a machine with CUDA, and no matter what I do I get a runtime error saying “Error: accessing tensor output of CUDAGraphs that has been overwritten by a subsequent run”.
Here is a minimal example reproducing the issue:
import torch
import torch.nn as nn
device = 'cuda'
model = nn.Linear(10, 1).to(device)
model = torch.compile(model, fullgraph=True, mode='reduce-overhead')
for _ in range(5):
torch.compiler.cudagraph_mark_step_begin()
input = torch.randn((10,), dtype=torch.float32, requires_grad=True, device=device)
outputs = model(input).clone()
loss = outputs.sum().clone()
loss.backward()
I get an error:
Traceback (most recent call last):
File "runme.py", line 15, in <module>
loss.backward()
File "...lib/python3.12/site-packages/torch/_tensor.py", line 626, in backward
torch.autograd.backward(
File "...lib/python3.12/site-packages/torch/autograd/__init__.py", line 347, in backward
_engine_run_backward(
File "...lib/python3.12/site-packages/torch/autograd/graph.py", line 823, in _engine_run_backward
return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Error: accessing tensor output of CUDAGraphs that has been overwritten by a subsequent run. Stack trace: File "...lib/python3.12/site-packages/torch/_dynamo/external_utils.py", line 45, in inner
return fn(*args, **kwargs). To prevent overwriting, clone the tensor outside of torch.compile() or call torch.compiler.cudagraph_mark_step_begin() before each model invocation.
Im using torch ‘2.6.0+cu124’, CUDA Version 12.4, Ubuntu 22.04.5, with NVIDIA A10G.
Any help appreciated