I would guess you might not see a parallel execution since your CPU might not be able to schedule the kernels fast enough and/or you might have (unwanted) synchronizations in your code.
In the default “eager” mode the CPU needs to dispatch to the internal operator implementation and dispatches the CUDA kernel.
If your model has e.g. 100 layers (and for simplicity let’s claim each layer is calling into a single CUDA kernel), the CPU would need to schedule 100 kernels in model_0 before it can start scheduling the workload in model_1.
Assuming the CPU is fast enough and you are not blocking it with synchronizations, you should see an overlap and could use a profiler (e.g. the native PyTorch profiler or Nsight Systems) to check it.
However, if your overall workload is already CPU-limited, you would of course not see any overlap.
This should already be visible while running a single model in a profiler and Nsight Systems would show “whitespaces” between the actual kernel execution.
If you are working with static input shapes (and meet other requirements) you could try to use CUDA Graphs as described here to reduce the kernel launch overheads.