Why is pytorch's GPU utilization so low in production ( NOT training )?

Oh, thank you for your detailed explanation. :wink:

I am sorry to bother you again.

If the GPU utilization ( checked by nvidia-smi ) is defined as @0xFFFFFFFF mentioned above, it means that the time data loading cost is much more than the forward pass cost, so that there exist periods of time that no kernel is present on GPU and only loading data to GPU, right?:thinking:

If so, there really is a ‘gap’ between data loading and processing. Is there any way to avoid it?

Thanks a million.

Yes, that’s usually the case if your actual workload on the GPU is small and thus your CPU code execution cannot be hidden. You could try to play around with the number of workers to possibly speed up the data loading. Also make sure the data is stored on an SSD. If you are using some image preprocessing, you might want to install PIL-SIMD, which is a drop-in replacement for PIL using SIMD instructions.

3 Likes

Ok, thank you very much, i will have a try. Thanks.

My understanding of data loading makes me believe it wouldn’t be relevant on inference ( not training but production ). All the weight, model and input start from GPU RAM ( because they are only a couple GBs combined and can be pre-loaded onto the device before inference). If Dataloader is supposed to aide asynchronously copying memory from CPU to GPU while GPU is doing some work, then it doesn’t help.

If you could copy the input data somehow onto the GPU beforehand (and have the memory to do so), then the DataLoader won’t help anything, that’s true.

However, if we are talking about production systems, I assume you’ll get the data from some kind of streaming service (in which case the usage of a DataLoader wouldn’t make sense). In that case you would still have to push the data onto the GPU or am I misunderstanding your use case?

The input to tacotron2, for example, is a string of texts. Therefore, it merely takes a couple hundred bytes. So, yes we do get input data from a streaming service, but all our engines have input that are less than a megabyte which makes the cost of loading them upfront before running the model trivial. So data loading is not the cause of low utilization.

Yeah right, you’ve mentioned tacotron2. Have you had a chance to profile it?
I’ll try to get it working on my machine and have a look at it.

@ptrblck I liked the ‘dummy code’ above, so I thought I’d play with it a little bit, since I’ve also been trying to understand some low utilization in Pytorch. Maybe this is getting a little off topic…but maybe not.

I made simple tweaks to support training and testing modes, different # workers, different batch sizes, and Windows. Still resnet50, (3,224,224). (Also:Titan XP, pytorch-nightly from Feb 28).

The timing numbers ignore the first call to the model, since that’s much slower. Utilization (for training mode) is from nvidia-smi, captured by hand.

High ‘utilization’ can be reached in training mode, at high batch sizes (as one would expect).

TRAIN Rate Imgs/s Imgs/s Imgs/s Util. Util. Util.
Batch Size #Wrk=1 #Wrk=2 #Wrk=4 #Wrk=1 #Wrk=2 #Wrk=4
1 10.51 10.82 10.79 23% 27% 25%
2 20.81 17.62 20.85 27% 30% 30%
4 40.86 43.28 42.51 28% 37% 37%
8 79.31 85.84 82.68 32% 35% 50%
16 142.99 165.7 161.5 46% 50% 78%
32 155.08 183.86 179.83 70% 75% 89%

For test mode, I get:

TEST Rate Imgs/s Imgs/s Imgs/s Imgs/s Imgs/s Imgs/s Imgs/s Imgs/s
Batch Size #Wrk=1 #Wrk=2 #Wrk=4 #Wrk=8 #Wrk=1 #Wrk=2 #Wrk=4 #Wrk=8
1 33.16 33.37 33.48 31.83 20% 19% 19% 19%
2 65.21 65.08 55.71 62.35 29% 22% 24% 22%
4 125.79 123.84 118.78 116.11 28% 31% 24% 28%
8 147.84 227.16 226.23 220.73 30% 35% 39% 41%
16 162.01 271.54 393.73 396.41 30% 39% 58% 60%
32 170.08 293.18 475.14 529.37 32% 57% 78% 77%
~Max possible 640

Max possible is estimated by dividing BS=32 by the best-case GPU time in nvprof (~50ms). I see a pattern in nvprof for 4 workers, where it’s ~(50,50,50,50)ms, and then there is an extra delay. I don’t quite see the same pattern for #workers=8, but I do see interspersed gaps.

[BTW: similar-in-spirit benchmark code in recent MXNet gives 732 images/s, comparable to the 640 number, I believe. If there is more interest in flushing this out (e.g. with nvprof timings), I can start a new thread here, or a new issue in Pytorch,.]

Here is my modified code:

import torch
import torch.nn as nn
from torch.utils.data import DataLoader

import torchvision.models as models
import torchvision.datasets as datasets
import torchvision.transforms as transforms
import time

def main():
    mode = 'test'
    model = models.resnet50()
    criterion = nn.CrossEntropyLoss()
    optimizer = torch.optim.SGD(model.parameters(), lr=1e-3)
    N = 1280
    dataset = datasets.FakeData(size=N, transform=transforms.ToTensor())
    if mode=='test': # switch to evaluate mode
        model.eval()
    model.to('cuda')
    for num_workers in [1, 2, 4, 8]: # 4 < 2 for test
        for batch_size in [1, 2, 4, 8, 16, 32]:
            loader = DataLoader(dataset, num_workers=num_workers, batch_size=batch_size, pin_memory=True)
            if mode=='test':
                for i, (data, target) in enumerate(loader):
                    if i==1:
                        tm = time.time()
                    data = data.to('cuda', non_blocking=True)
                    output = model(data)
            else: # mode=='train':
                for i, (data, target) in enumerate(loader):
                    if i==1:
                        tm = time.time()
                    data = data.to('cuda', non_blocking=True)
                    target = target.to('cuda', non_blocking=True).long()
                    optimizer.zero_grad()
                    output = model(data)
                    loss = criterion(output, target)
                    loss.backward()
                    optimizer.step()
            tm = time.time() - tm
            print('Mode=%s: NumWorkers=%2d  BatchSize=%2d  Time=%6.3fs  Imgs/s=%6.2f' % (mode, num_workers, batch_size, tm, N/tm))
            torch.cuda.empty_cache() # doesn't seem to be working...

if __name__ == '__main__':
    main()

Maybe we should create a separate thread regarding increasing utilization in training. The topic of this thread is mostly for inference ( production ) and not training. Therefore, dataloader, as discussed above, is not relevant since it is not used at all. I changed the title to make it more clear.

For completeness, I edited the post above, adding the (rough) utilization numbers. They follow a similar trend as training, but are generally lower, as expected; the GPU is less busy, without back-prop.

Certainly, dataloader wouldn’t be used for production. The numbers from nvprof are more telling.

Resnet50 is not Tacotron, so that would have to be benchmarked (and examined in nvprof). But in general, you’ll either need to go data-parallel or model-parallel (if you have the memory) to get the highest utilization.

If your interested in speeding up inference, I’d suggest looking at this: https://developer.nvidia.com/tensorrt. Even if you don’t think this looks applicable to your situation, some of the TensorRT documentation has good discussions about inference performance, see here: https://docs.nvidia.com/deeplearning/sdk/pdf/TensorRT-Best-Practices.pdf

Hello, I just test in inference process, if I have 25 images need be handled, it seems 1st time cost some time, then next 24 images cost very little time. but the interesting thing is if I have 26 images then the 26 images will cost same time with 1st image. could you help to explain that?

1st image Time: 0.4107s

2nd -25nd images Time: 0.0010s

26 images Time: 0.4206s

Here is some profiling data I have collected using Nsight Systems during inference of Tacotron2. When we look at the result, it becomes clear why the utilization is so low when we perform inference with Pytorch.

The problems are:

  1. extremely small kernels ( that take around 5 micro seconds ) are called one at a time, resulting in the cost of kernal launch ( on the CPU )being generally more expensive than the cost of the kernel itself ( on the GPU ). This makes the time of invoking a kernel more expensive than actually doing the computation ( for example, the time taken on the CPU to launch “gemv2T_kernel_val” is about 15 micro seconds, where as the time taken on the GPU to actually complete the computation is about 5 micro seconds ).

  2. there seems to be gaps between each CUDA API calls because Pytorch adds additional wrappers around the tensors.

This is a portion of the Nsight Compute profiler. The skyblue bar indicates some sort of work happening on the GPU. the absence of the sky blue bar or the red bar indicates that no computation is happening on the GPU. The “CUDA API” row indicates CPU’s preperation of launching cuda kernels – it is CPU side work in order to launch kernel on the GPU.

As we zoom in furhter, we can see that there are HUGE gaps on the work that is happening on the device.

and even more if we zoom in even further

1 Like

In this issue, a dev from nvidia explains why this problem is occuring. Essentially, the asnwer is: pytorch is not optimized well plus the nature of Tacotron2’s network architecture produced this low nvidia-smi utilization. It is not a bug.

2 Likes

Utilization was around 40%. What do you think could cause this?

Ubuntu 18.04.3 LTS
PyTorch 1.2.0
Python 3.7.4
GeForce RTX 2080 Ti
Driver Version: 418.74
CUDA Version: 10.1

With Tesla V100-sxm2, I’m getting the utilization less than 30% (mostly 5%-20%) with this dummy code.
What could I do to raise the utilization?

Ubuntu 16.04
Pytorch 1.2.0
Driver 418.67
CUDA 10.1

@alexmath think this may also answer your question.

If you want to make sure that you are utilizing the GPU completely, try running CUDA MPS. Simply running multiple instances of the network without CUDA MPS will increase nvidia-smi utilization to near 100%, but this doesn’t mean you are actually using all of GPU.

Hi, I have the same problem but my GPU usage is like 1%. I ran your code. Can you help me plz ??

Did you try out different batch sizes as well as number of workers?
As a quick test you could remove the complete data loading, create a single input batch on the GPU and just train the model with it. This should yield a high GPU utilization, if not synchronizations were added.

Hi, I am facing a similar situation where I am trying to see the best possible training speed that I could get. I am working on a multi-class segmentation problem and I am transferring the training loss (per each class) to CPU in each iteration for displaying purpose. I am also transferring validation scores for each class to CPU after each epoch. I want to know why it should affect the overall speed of my model’s training?