Why it’s hard to deal with the loop and condition in PyTorch graph capture? but cpp can deal with loops and conditions well?
CUDA Graphs captures the workload and is able to significantly reduce the launch overheads by replaying the captured work.
This also requires to read and write to the same (virtual) memory addresses and to avoid CPU workload.
Conditions in Python code are executed on the host and are thus not captured.
These docs explain the limitations in more detail.