Hi there,
I am running pytorch on our 8-card cloud machine. So I use docker to run it with only required resources. Here is the docker parameters I used to launch it: docker run -it --rm --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES 0 --cpus=4 --shm-size=1G --memory=10240m *image_name* /bin/bash
Based on the running command above, I find the training speed is dramatically slower when pin_memory setting True in DataLoader. I write a test benchmark code based on horovod benchmark code. I have post it to my github repo.
The benchmark result shows, without data loading, the card could process 300 images/second. Adding data loading process, with pin_memory=False, the card could process 292 images/second, which is reasonable. However, with pin_memory=True, the card could only process 46 images/second.
I know pin memory is used to accelerate tensor async transfer between host memory and device memory. But the performance here is very strange. Could anyone illustrate me why it is and how to fix it? Thanks.