torch.utils.data.distributed.DistributedSampler very slow over Lustre

All,

I am attempting to troublehsoot poor performance of torch.utils.data.distributed.DistributedSampler when deployed onto a machine using Lustre for file serving.

The specific scenario is a resnet50 network, using imagenet for inputs. My cluster uses Lustre for fileserving, and the node I am using has a pair of Volta V100 GPUs. I expect a throughput of 600-1300 images/second per GPU depending on settings and configuration details.

My first attempt was reading directly from the mounted Lustre FS. This had the obvious problem of being limited by the ability of the FS server to handle large numbers of small file requests. Speed results were on the order of 100 img/s/GPU.

In order to isolate the performance bottleneck, I stripped out portions of the training loop until I was left with only the sampler being used each loop. Performance did not improve.

I tried mounting a squashfs image containing the dataset, in order to limit fileserver metadata requests, but performance did not improve substantially. I also tried an ext4 image which also did not improve results significantly.

An analogous benchmark in tensorflow achieves about 1100 imgs/sec/GPU on the same node, so I believe there is something specific to either Torch or the way that I am using Torch.

Can anyone suggest next steps for troubleshooting this?