Slow training, low memory usage on Tesla V100 16GB

For my research, I’m running a GAN on the deep learning online platform FloydHub.

My DCGAN-based model uses 64x64 images from a data set of >450k images, and my batch size is 128. The inputs to my generator and discriminator are constant (i.e. batches of 128 64x64 images). My local setup isn’t too powerful, and thankfully I have some research funds, so am trying to maximize the number of experiments I can run and thus am using the powerful but expensive Tesla V100 16GB GPU.

I do use cuda, and the GPU is found:

self.device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
# prints "GPU/CPU: GPU/CPU: Tesla V100-SXM2-16GB" 

I set call .to(device), the new .cuda(), on my generator, discriminator, loss, noise and label tensors, as well as the fake and real images. When I checked, cudnn.benchmark was by default True. The num_workers of the DataLoader is 0.

When I print the weights of generator layer or a discriminator layer, device='cuda:0' is included, so I know that these are using cuda.

However, my GPU usage hovers at around or below 40%, and a pass over the data set (3817 batches) takes >10 minutes. I’d like to run for 100 epochs, but I’d prefer it didn’t take so long! I figure I must be doing something wrong/forgetting to cuda something (can’t think of what that would be though) or perhaps there’s some magic cuda or cudnn flag I can set [although there doesn’t seem to be any documentation for torch.backends.cudnn !]

Any ideas? Or is this just how long training should take? Many thanks in advance.

Maybe your bottleneck is I/O bound.
Could you try to increase the number of workers and see, if your training speeds up?
You might also time your data loading using this example from the ImageNet training.

So I am using 4 V100s on one machine to train different network architectures (resnet152, pnasnet5large from Cadenes pretrainedmodels repo) and experience a similar behavior. The Scripts run in the latest nvidia pytorch container from

I can’t get the GPUs to 100% volatile GPU-utilization. It keeps hovering between 0 and 100% on my server.

I built my script upon the ImageNet training example and, at least for the pnasnet5large, I only get data timings of 0.000 and 0.001. This makes sense since I can only train with a batch size of 72 with the pnasnet5large.

I use a standard DataLoader to load the data from an SSD Raid 0 and tried everything between 4 and 24 workers.

As a sanity check I tried the setup with the plain from the imageNet example and the GPUs would show the same utilization.

Do I miss something here? Do I need to use the multiprocessing-distributed settings to make use of the full power of these GPUs?

I tried to get it to work, but I get the following error:
mp.spawn(main_worker, nprocs=ngpus_per_node, args=(ngpus_per_node, args))
AttributeError: module ‘torch.multiprocessing’ has no attribute ‘spawn’

In relation to that I find a lot of posts stating that the problem is the PyTorch version in this case.
The torch version in the NVIDIA container seems to be 1.0.0a0 so that can’t really be the reason.

I hope you can help me.
Best regards,

hey @Andreas_Lu, did you happen to solve this?

Unfortunately I didn’t find a solution for the problem. I used the Volta-DGX to conduct experiments for my thesis, so I stopped working with it when I turned it in.
I tried quite a bit but didn’t get an improvement in GPU-utilization…
If you happen to find a solution, let me know as I am eager to learn the cause of this as well…

oh that sucks! I am using the tesla-dgx and it seems training is a lot slower than my local machine :frowning:

@Andreas_Lu, @nabsabs
I assume you are both using the DGX-1 with V100 GPUs?

Have you checked, if the data loading/processing might be the bottleneck in your setup?
If so, I would suggest to have a look at DALI, which should speedup the data processing pipeline.

PS: It should be already preinstalled in our containers.