I’ve read through a couple of gpu implementations for Kaggle Titanic and as a noob to Pytorch (basically left it until now since joining this forum a few years ago), noticed that data is moved to the gpu device in batches during training.
Are there any other more efficient ways?
I know in CUDA C/C++ programming there are many ways to move data synchronously or asynchronously to the GPU but with Pytorch I am lost. Is the above the only or best way? That just seems like a lot of data movement.
You could move the data asynchronously to the device manually by using pinned memory or via pin_memory=True
in the DataLoader
but would of course need some overlapping work to see any benefits.
Oh wow, I just learned that in my Cuda course “pinned memory in the host”.
If you have a moment, can you point out the parts of the Pytorch repo where I can see all the Cuda code?
I want to see if prefetchAsync, streams, calculating the number of blocks based on the number of SM’s are being used.