How to overlap h2d and training?

I didn’t use the profiler, actually, and was judging off the time it takes to queue the kernels.
Fiddling around a bit more, and using the nsight profiler, I get this interesting behaviour:
With N=1000, queuing up the kernels takes negligible time (~0.005s on my setup)
With N=2000, however, the duration jumps to ~2.3 seconds, so something is going on.
Here are the corresponding nsight timelines, where the top one is for N=1000, and the bottom one is for N=2000 :

Note that the initialization takes ~3 seconds in both, but after that N=2000 takes significantly more time to queue the kernels, much more than 2x.

Thought that was interesting. It might not be of practical importance to queue thousands of kernels.