Running Two Batches in Parallel Using CUDA Streams Does Not Overlap During Training

This post shows how to overlap data transfer and computation. To overlap compute kernels your GPU must have enough free resources. E.g. if the kernel running on the first stream is using all SMs, the other kernels have to wait.