I would stick to the first approach, as this would push a whole batch to the device (avoiding multiple small tranfers), which might potentially be executed asynchronously while your GPU is busy. Using pinned memory will also speedup the transfer. Have a look at NVIDIA’s blog post for some more information.