Slow training, low memory usage on Tesla V100 16GB

Andreas_Lu · January 11, 2019, 10:43am

So I am using 4 V100s on one machine to train different network architectures (resnet152, pnasnet5large from Cadenes pretrainedmodels repo) and experience a similar behavior. The Scripts run in the latest nvidia pytorch container from https://docs.nvidia.com/deeplearning/dgx/pytorch-release-notes/running.html.

I can’t get the GPUs to 100% volatile GPU-utilization. It keeps hovering between 0 and 100% on my server.

I built my script upon the ImageNet training example and, at least for the pnasnet5large, I only get data timings of 0.000 and 0.001. This makes sense since I can only train with a batch size of 72 with the pnasnet5large.

I use a standard DataLoader to load the data from an SSD Raid 0 and tried everything between 4 and 24 workers.

As a sanity check I tried the setup with the plain main.py from the imageNet example and the GPUs would show the same utilization.

Do I miss something here? Do I need to use the multiprocessing-distributed settings to make use of the full power of these GPUs?

I tried to get it to work, but I get the following error:
mp.spawn(main_worker, nprocs=ngpus_per_node, args=(ngpus_per_node, args))
AttributeError: module ‘torch.multiprocessing’ has no attribute ‘spawn’

In relation to that I find a lot of posts stating that the problem is the PyTorch version in this case.
The torch version in the NVIDIA container seems to be 1.0.0a0 so that can’t really be the reason.

I hope you can help me.
Best regards,
Andreas