As I see from the examples in CUDA semantics — PyTorch master documentation , the real input has the exact same shape of the static input of the initial recording of the graph.
Can we use torch tensor of different shape (i.e, different batch size) for the real input?
Moreover, can we use an object instead of torch.Tensor for the input? The object have several fields which are of torch.Tensor
Does the concern of using different shape of data have anything to do with the graph memory management? I can’t really understand the part in the Graph Memory Management Section
No, as described in the docs:
Replaying a graph sacrifices the dynamic flexibility of typical eager execution in exchange for greatly reduced CPU overhead. A graph’s arguments and kernels are fixed, so a graph replay skips all layers of argument setup and kernel dispatch, including Python, C++, and CUDA driver overheads. Under the hood, a replay submits the entire graph’s work to the GPU with a single call to cudaGraphLaunch. Kernels in a replay also execute slightly faster on the GPU, but eliding CPU overhead is the main benefit.
You should try CUDA graphs if all or part of your network is graph-safe (usually this means static shapes and static control flow, but see the other constraints) and you suspect its runtime is at least somewhat CPU-limited.
you would have to use static shapes since the arguments and kernels are fixed and just replayed.
I see, so maybe CUDA graph is not the approach to improve performance for such learning framework with variable input size, e.g. RL environment. Thank you so much
However, maybe a part of the module’s can be recorded, which have static input size. I’ll check on that.
Yes, that might certainly be the case. The Partial-network capture section would show an example.