I learned that when capturing a CUDA Graph, if a new tensor is created, the graph only records its memory address. During the replay phase, the graph will operate on that same address. Is there a risk of accessing an invalid address (if it wasn’t properly allocated) or overwriting memory belonging to other variables? Or does PyTorch’s internal memory pool (caching allocator) guarantee that these addresses remain valid and reserved?
import torch
static_tensor = torch.arange(5, device='cuda:0')
g = torch.cuda.CUDAGraph()
with torch.cuda.graph(g):
static_tensor.copy_(torch.arange(1, 1+5, device='cuda:0')) # create a new tensor here
g.replay()
Graph allocations have fixed addresses over the life of a graph including repeated instantiations and launches. This allows the memory to be directly referenced by other operations within the graph without the need of a graph update, even when CUDA changes the backing physical memory. Within a graph, allocations whose graph ordered lifetimes do not overlap may use the same underlying physical memory.
So I think it’s safe to create new gpu tensor because the lifetime will be managed by cuda and pytorch memory pool.
PyTorch uses a custom pool for CUDA graph allocations specifically for this reason, so just to confirm: yes it is safe to create new tensors during capture
Yes — it is safe to create new CUDA tensors during graph capture in PyTorch, as long as:
You are using the default CUDA caching allocator (the normal PyTorch behavior).
The allocation happens inside the capture region.
In this case, PyTorch and the caching allocator ensure that the virtual addresses remain valid across graph replays.
For your example specifically:
static_tensor = torch.arange(5, device='cuda:0')
This tensor is allocated outside the graph capture, so its lifetime is NOT managed by the graph. You are responsible for keeping it alive and ensuring it is not freed or resized while the graph is still in use.
Regarding this statement:
memory allocation during cuda capturing will be recorded as a memory node
This refers to CUDA graph-level memory nodes, which are created when capturing cudaMallocAsync.
However, this is generally not recommended unless you fully understand it. The default caching allocator already provides correct and safe behavior for CUDA graph capture in typical use cases.