Worse performance using pin memory in dataloader

rtrobin · August 2, 2019, 12:05pm

Hi there,

I am running pytorch on our 8-card cloud machine. So I use docker to run it with only required resources. Here is the docker parameters I used to launch it: docker run -it --rm --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES 0 --cpus=4 --shm-size=1G --memory=10240m *image_name* /bin/bash

Based on the running command above, I find the training speed is dramatically slower when pin_memory setting True in DataLoader. I write a test benchmark code based on horovod benchmark code. I have post it to my github repo.

The benchmark result shows, without data loading, the card could process 300 images/second. Adding data loading process, with pin_memory=False, the card could process 292 images/second, which is reasonable. However, with pin_memory=True, the card could only process 46 images/second.

I know pin memory is used to accelerate tensor async transfer between host memory and device memory. But the performance here is very strange. Could anyone illustrate me why it is and how to fix it? Thanks.

ptrblck · August 4, 2019, 10:08pm

Are you seeing the swap being used a lot?
As far as I know, this might cause the slow down.

rtrobin · August 5, 2019, 7:17am

@ptrblck Thanks for the reply. I have found the trick. I’m not a professional in CPU field. It seems that something with swap does slow the tensor transfer speed.

Our cloud machine has two NUMA nodes. The docker parameter --cpus=4 doesn’t allocate 4 cpu cores to the job, but load every cores balanced under cpu rate 4/cpu_core_nums. To run the training job correctly, the docker parameter --cpuset-cpus should be used.