What is the best way to move data to the GPU?

I’ve read through a couple of gpu implementations for Kaggle Titanic and as a noob to Pytorch (basically left it until now since joining this forum a few years ago), noticed that data is moved to the gpu device in batches during training.
Are there any other more efficient ways?
I know in CUDA C/C++ programming there are many ways to move data synchronously or asynchronously to the GPU but with Pytorch I am lost. Is the above the only or best way? That just seems like a lot of data movement.

You could move the data asynchronously to the device manually by using pinned memory or via pin_memory=True in the DataLoader but would of course need some overlapping work to see any benefits.

Oh wow, I just learned that in my Cuda course “pinned memory in the host”.

If you have a moment, can you point out the parts of the Pytorch repo where I can see all the Cuda code?

I want to see if prefetchAsync, streams, calculating the number of blocks based on the number of SM’s are being used.

Found it: pytorch/c10/cuda at 674e52b0b913d7b7f733ce1e73a42cb383860d55 · pytorch/pytorch · GitHub