Setup:
-
the project is vision problem to process images.
-
the data loader is based on the following implementation so that the data copying has already been overlapped with network forward/backward, as my understanding
Lighter/loaders.py at master · HenryJia/Lighter (github.com) -
if the image data is fp32, the speed is 2000 images per sec; if it is fp16, it is 2500 images per sec. Thus, I assume the data copying from cpu to gpu takes a lot of time. In the meanwhile, data loading time is super small, so the CPU data loading part should not be a problem. The batch size for each GPU is 32; image size is 640x640.
Problem:
-
One question is that as non_blocking=true is used, is there a way to check whether the data is ready for forward/backward in GPU so that I can confirm that cpu->gpu is indeed a bottleneck.
-
Another question is that whether there is a faster way to move data from CPU to GPU.