CUDA Kernel Launches

Say I do output = model(input) to do a forward pass. Does PyTorch launch a single CUDA kernel for this, or is it possible that multiple kernels are launched for a single forward pass call? Likewise for backward pass, loss calculation and optimizer step. Thank you.

Multiple CUDA kernels will be launched unless the forward pass contains a single operation (unlikely) or unless you’ve used CUDA graphs to capture the entire forward pass as described here.