Running independent `nn.Module` instances in `nn.ModuleList` truly in parallel in PyTorch

CUDA operations are asynchronous w.r.t. the CPU by default. If you launch multiple kernels the CPU can run ahead (assuming it’s fast enough and the GPU workload is large enough). This does not mean the GPU work is done and your profiling just shows when the host arrives at the line of code stopping the timer. If you would synchronize now or access the result of the computation, PyTorch will implicitly synchronize the code.

Your first approach is showing overheads of multiple synchronizations inside the loop, which should also be visible in a profiler. Use the native profiler or Nsight Systems to see the actual execution and the blocking operations including white spaces between the kernels.

Using custom CUDAStreams allows you to execute kernels in parallel on the GPU. To do so enough compute resources would be needed as kernels are generally written in a way to saturate all compute resources.
Take a look at this post and GTC presentation for more details.