How to make CPU->GPU faster?


  • the project is vision problem to process images.

  • the data loader is based on the following implementation so that the data copying has already been overlapped with network forward/backward, as my understanding
    Lighter/ at master · HenryJia/Lighter (

  • if the image data is fp32, the speed is 2000 images per sec; if it is fp16, it is 2500 images per sec. Thus, I assume the data copying from cpu to gpu takes a lot of time. In the meanwhile, data loading time is super small, so the CPU data loading part should not be a problem. The batch size for each GPU is 32; image size is 640x640.


  • One question is that as non_blocking=true is used, is there a way to check whether the data is ready for forward/backward in GPU so that I can confirm that cpu->gpu is indeed a bottleneck.

  • Another question is that whether there is a faster way to move data from CPU to GPU.

Which image format are you using?
You can think about saving raw arrays or using gpu decoding if you work with jpg images.
Another suggestion is you to try DALI as it’s the fastest dataloader you can use.

it is jpg format. Thanks for pointing DALI, and I will look at this.