Imagenet training extremely low GPU usage

As pointed out in https://github.com/pytorch/examples/issues/164, the imagenet training gets almost zero gpu utilization, I have 4 titan-V gpu, and the data is locally stored, although not on SSD, but my disk loading is high enough that should not be so slow at least, I posted a issue on github, but no response for several days, and the previous post was not solved, I tried 8 workers and 20 workers, gpu usage are all low, for 8 workers:

Epoch: [20][100/5005]   Time 0.283 (1.396)      Data 0.001 (1.006)      Loss 2.5632 (2.3716)    Prec@1 48.438 (47.486)  Prec@5 67.969 (72.424)
Epoch: [20][110/5005]   Time 0.239 (1.390)      Data 0.001 (1.013)      Loss 2.4275 (2.3647)    Prec@1 49.609 (47.646)  Prec@5 71.484 (72.579)
Epoch: [20][120/5005]   Time 3.725 (1.422)      Data 3.495 (1.056)      Loss 2.0656 (2.3711)    Prec@1 53.906 (47.556)  Prec@5 75.781 (72.382)
Epoch: [20][130/5005]   Time 3.500 (1.427)      Data 3.267 (1.069)      Loss 2.4683 (2.3707)    Prec@1 45.312 (47.442)  Prec@5 68.750 (72.343)
Epoch: [20][140/5005]   Time 0.228 (1.396)      Data 0.001 (1.046)      Loss 2.2713 (2.3637)    Prec@1 50.781 (47.565)  Prec@5 72.266 (72.407)

for 20 workers:

Epoch: [20][530/5005]   Time 0.227 (0.781)      Data 0.001 (0.507)      Loss 2.4582 (2.3638)    Prec@1 42.969 (47.641)  Prec@5 68.359 (72.389)
Epoch: [20][540/5005]   Time 0.317 (0.772)      Data 0.001 (0.498)      Loss 2.2743 (2.3646)    Prec@1 48.438 (47.633)  Prec@5 75.000 (72.362)
Epoch: [20][550/5005]   Time 0.225 (0.786)      Data 0.001 (0.511)      Loss 2.1320 (2.3634)    Prec@1 50.000 (47.668)  Prec@5 76.172 (72.384)
Epoch: [20][560/5005]   Time 0.290 (0.776)      Data 0.003 (0.502)      Loss 2.4872 (2.3635)    Prec@1 44.141 (47.633)  Prec@5 67.969 (72.380)
Epoch: [20][570/5005]   Time 0.250 (0.782)      Data 0.002 (0.507)      Loss 2.3034 (2.3634)    Prec@1 47.266 (47.608)  Prec@5 74.609 (72.364)
Epoch: [20][580/5005]   Time 2.115 (0.776)      Data 1.873 (0.502)      Loss 2.3284 (2.3650)    Prec@1 45.312 (47.570)  Prec@5 72.656 (72.340)
Epoch: [20][590/5005]   Time 0.399 (0.782)      Data 0.002 (0.508)      Loss 2.4217 (2.3645)    Prec@1 45.703 (47.591)  Prec@5 70.703 (72.348)
Epoch: [20][600/5005]   Time 3.144 (0.778)      Data 2.857 (0.504)      Loss 2.3866 (2.3629)    Prec@1 48.828 (47.632)  Prec@5 71.875 (72.362)
Epoch: [20][610/5005]   Time 0.236 (0.784)      Data 0.002 (0.510)      Loss 2.3191 (2.3638)    Prec@1 51.953 (47.630)  Prec@5 73.047 (72.362)
Epoch: [20][620/5005]   Time 0.231 (0.776)      Data 0.001 (0.502)      Loss 2.4194 (2.3634)    Prec@1 50.000 (47.652)  Prec@5 71.875 (72.359)
Epoch: [20][630/5005]   Time 0.298 (0.788)      Data 0.001 (0.514)      Loss 2.3440 (2.3624)    Prec@1 47.266 (47.674)  Prec@5 69.922 (72.368)
Epoch: [20][640/5005]   Time 1.156 (0.782)      Data 0.841 (0.507)      Loss 2.5047 (2.3640)    Prec@1 46.094 (47.629)  Prec@5 69.531 (72.345)
Epoch: [20][650/5005]   Time 0.230 (0.787)      Data 0.002 (0.513)      Loss 2.4881 (2.3637)    Prec@1 46.484 (47.629)  Prec@5 73.438 (72.354)
Epoch: [20][660/5005]   Time 0.733 (0.780)      Data 0.385 (0.506)      Loss 2.3043 (2.3642)    Prec@1 48.828 (47.620)  Prec@5 74.219 (72.355)
Epoch: [20][670/5005]   Time 0.222 (0.791)      Data 0.001 (0.517)      Loss 2.4218 (2.3640)    Prec@1 50.000 (47.635)  Prec@5 70.312 (72.358)
Epoch: [20][680/5005]   Time 0.726 (0.784)      Data 0.497 (0.510)      Loss 2.0819 (2.3638)    Prec@1 53.906 (47.653)  Prec@5 75.391 (72.349)
Epoch: [20][690/5005]   Time 0.224 (0.795)      Data 0.002 (0.521)      Loss 2.2428 (2.3634)    Prec@1 49.219 (47.669)  Prec@5 75.000 (72.358)
Epoch: [20][700/5005]   Time 0.278 (0.787)      Data 0.003 (0.513)      Loss 2.4094 (2.3639)    Prec@1 44.141 (47.653)  Prec@5 70.312 (72.346)
Epoch: [20][710/5005]   Time 0.436 (0.798)      Data 0.003 (0.523)      Loss 2.3120 (2.3633)    Prec@1 50.000 (47.665)  Prec@5 71.484 (72.351)
Epoch: [20][720/5005]   Time 0.234 (0.790)      Data 0.001 (0.516)      Loss 2.5496 (2.3646)    Prec@1 44.922 (47.650)  Prec@5 69.141 (72.336)
Epoch: [20][730/5005]   Time 0.232 (0.800)      Data 0.001 (0.526)      Loss 2.1596 (2.3641)    Prec@1 51.953 (47.666)  Prec@5 76.562 (72.350)
Epoch: [20][740/5005]   Time 0.226 (0.793)      Data 0.001 (0.519)      Loss 2.4315 (2.3641)    Prec@1 45.703 (47.657)  Prec@5 71.094 (72.357)
Epoch: [20][750/5005]   Time 0.244 (0.803)      Data 0.001 (0.529)      Loss 2.2962 (2.3637)    Prec@1 45.703 (47.650)  Prec@5 72.266 (72.376)
Epoch: [20][760/5005]   Time 0.316 (0.796)      Data 0.001 (0.522)      Loss 2.4111 (2.3642)    Prec@1 50.781 (47.631)  Prec@5 72.656 (72.366)
Epoch: [20][770/5005]   Time 0.245 (0.802)      Data 0.001 (0.529)      Loss 2.4344 (2.3643)    Prec@1 48.828 (47.611)  Prec@5 71.875 (72.360)
Epoch: [20][780/5005]   Time 0.346 (0.795)      Data 0.001 (0.522)      Loss 2.3858 (2.3640)    Prec@1 45.703 (47.617)  Prec@5 71.094 (72.362)
Epoch: [20][790/5005]   Time 0.290 (0.802)      Data 0.002 (0.529)      Loss 2.5051 (2.3643)    Prec@1 44.922 (47.622)  Prec@5 72.656 (72.356)
Epoch: [20][800/5005]   Time 0.224 (0.795)      Data 0.001 (0.522)      Loss 2.2296 (2.3641)    Prec@1 48.047 (47.624)  Prec@5 74.219 (72.347)
Epoch: [20][810/5005]   Time 0.239 (0.800)      Data 0.002 (0.527)      Loss 2.2256 (2.3643)    Prec@1 49.609 (47.622)  Prec@5 74.219 (72.345)

my pin_memory is set to True, my dataloader is configured as

    train_loader = torch.utils.data.DataLoader(
        train_dataset, batch_size=args.batch_size, shuffle=(train_sampler is None),
        num_workers=args.workers, pin_memory=True, sampler=train_sampler)

    val_loader = torch.utils.data.DataLoader(
        datasets.ImageFolder(valdir, transforms.Compose([
            transforms.Resize(256),
            transforms.CenterCrop(224),
            transforms.ToTensor(),
            normalize,
        ])),
        batch_size=args.batch_size, shuffle=False,
        num_workers=args.workers, pin_memory=True)

The only change I made is adding my own resnet101 module for training, without changing other part,
@ptrblck, I saw you have some suggestion in GPU: high memory usage, low GPU volatile-util, but I don’t know how to further check on the issue, could you pls help? thanks!

what if you remove you custom resnet101?

Thanks for your response, I remove the my custom code part and use the original code with default resnet18 training, still very low gpu usage:

|Epoch: [0][0/5005]|Time 279.412 (279.412)|Data 8.915 (8.915)|Loss 7.0321 (7.0321)|Prec@1 0.000 (0.000)|Prec@5 0.391 (0.391)|
|---|---|---|---|---|---|
|Epoch: [0][10/5005]|Time 0.112 (26.322)|Data 0.001 (1.583)|Loss 7.0418 (7.0443)|Prec@1 0.000 (0.036)|Prec@5 0.781 (0.426)|
|Epoch: [0][20/5005]|Time 0.153 (14.640)|Data 0.087 (1.647)|Loss 7.1721 (7.0928)|Prec@1 0.000 (0.056)|Prec@5 0.391 (0.558)|
|Epoch: [0][30/5005]|Time 0.784 (10.761)|Data 0.723 (1.932)|Loss 6.9021 (7.0816)|Prec@1 0.391 (0.076)|Prec@5 0.781 (0.655)|
|Epoch: [0][40/5005]|Time 0.111 (8.569)|Data 0.001 (1.873)|Loss 6.9479 (7.0614)|Prec@1 0.391 (0.114)|Prec@5 0.781 (0.696)|
|Epoch: [0][50/5005]|Time 0.880 (7.415)|Data 0.813 (2.015)|Loss 6.8931 (7.0361)|Prec@1 0.000 (0.130)|Prec@5 0.391 (0.781)|
|Epoch: [0][60/5005]|Time 0.104 (6.499)|Data 0.001 (1.970)|Loss 6.8358 (7.0106)|Prec@1 0.000 (0.166)|Prec@5 1.172 (0.890)|
|Epoch: [0][70/5005]|Time 2.886 (5.963)|Data 2.786 (2.059)|Loss 6.8376 (6.9882)|Prec@1 0.000 (0.165)|Prec@5 1.172 (0.935)|
|Epoch: [0][80/5005]|Time 0.105 (5.458)|Data 0.001 (2.026)|Loss 6.7916 (6.9656)|Prec@1 0.000 (0.183)|Prec@5 1.562 (1.008)|
|Epoch: [0][90/5005]|Time 4.114 (5.140)|Data 4.039 (2.075)|Loss 6.7940 (6.9472)|Prec@1 0.000 (0.219)|Prec@5 1.172 (1.039)|
|Epoch: [0][100/5005]|Time 0.113 (4.808)|Data 0.001 (2.038)|Loss 6.7005 (6.9283)|Prec@1 0.000 (0.251)|Prec@5 0.781 (1.114)|
|Epoch: [0][110/5005]|Time 6.494 (4.620)|Data 6.420 (2.092)|Loss 6.7363 (6.9102)|Prec@1 0.391 (0.289)|Prec@5 1.562 (1.228)|
|Epoch: [0][120/5005]|Time 0.104 (4.387)|Data 0.001 (2.060)|Loss 6.7741 (6.8942)|Prec@1 0.391 (0.313)|Prec@5 1.562 (1.311)|
|Epoch: [0][130/5005]|Time 6.324 (4.253)|Data 6.260 (2.097)|Loss 6.6735 (6.8780)|Prec@1 1.172 (0.331)|Prec@5 2.734 (1.378)|
|Epoch: [0][140/5005]|Time 0.104 (4.076)|Data 0.001 (2.067)|Loss 6.5866 (6.8644)|Prec@1 0.391 (0.341)|Prec@5 3.125 (1.438)|

actually I’m also not sure what the printed line means, Time 0.104 (6.499)|Data 0.001 (1.970) what is their unit, in seconds?

Buy an SSD.

Hard drives (spinning magnetic disk) don’t do well with small random reads. Data loading is going to be a bottleneck.

1 Like

I see, currently how can I estimate the training time? for the printed time I’m really confused what is the total time for each print, should I add Time and Data together, and what is the unit, thank you!

And there is always the warining :

/home/user/anaconda2/envs/pytorch/lib/python2.7/site-packages/PIL/TiffImagePlugin.py:747: UserWarning: Possibly corrupt EXIF data.  Expectin
g to read 2555904 bytes but only got 0. Skipping tag 0

do I have to take care of this?

Ignore that warning. We saw that on our copy of ImageNet too, so it’s probably ImageNet’s problem.

@twangnh, @SimonW, @colesbury
My GPU speed is badly affected by Using the following custom dropout layer. Can anyone tell me how it can be improved

import torch
import torch.nn as nn
import torch.optim as optim
import torch.utils.data as data
from torch.autograd import Variable
from torchvision import datasets, transforms

from tqdm import tqdm_notebook

class GaussianDropout(nn.Module):
    def __init__(self, alpha=1.0):
        super(GaussianDropout, self).__init__()
        self.alpha = torch.Tensor([alpha])
        
    def forward(self, x):
        """
        Sample noise   e ~ N(1, alpha)
        Multiply noise h = h_ * e
        """
        if self.train():
            # N(1, alpha)
            epsilon = torch.randn(x.size()) * self.alpha + 1

            epsilon = Variable(epsilon)
            if x.is_cuda:
                epsilon = epsilon.cuda()

            return x * epsilon
        else:
            return x

class VariationalDropout(nn.Module):
    def __init__(self, alpha=1.0, dim=None):
        super(VariationalDropout, self).__init__()
        
        self.dim = dim
        self.max_alpha = alpha
        # Initial alpha
        log_alpha = (torch.ones(dim) * alpha).log()
        self.log_alpha = nn.Parameter(log_alpha)
        
    def kl(self):
        c1 = 1.16145124
        c2 = -1.50204118
        c3 = 0.58629921
        
        alpha = self.log_alpha.exp()
        
        negative_kl = 0.5 * self.log_alpha + c1 * alpha + c2 * alpha**2 + c3 * alpha**3
        
        kl = -negative_kl
        
        return kl.mean()
    
    def forward(self, x):
        """
        Sample noise   e ~ N(1, alpha)
        Multiply noise h = h_ * e
        """
        if self.train():
            # N(0,1)
            epsilon = Variable(torch.randn(x.size()))
            if x.is_cuda:
                epsilon = epsilon.cuda()

            # Clip alpha
            self.log_alpha.data = torch.clamp(self.log_alpha.data, max=self.max_alpha)
            alpha = self.log_alpha.exp()

            # N(1, alpha)
            epsilon = epsilon * alpha

            return x * epsilon
        else:
            return x

def dropout(p=None, dim=None, method='standard'):
    if method == 'standard':
        return nn.Dropout(p)
    elif method == 'gaussian':
        return GaussianDropout(p/(1-p))
    elif method == 'variational':
        return VariationalDropout(p/(1-p), dim)

If I use inbuilt pytorch dropout nn.Dropout(p=dropout_rate) then GPU utilization is nearly 98%.

I have also observed that the above code is working fine with the dense neural network. However, in CNN architecture such as wide-renet replacing nn.Dropout(p=dropout_rate) (line 27) with Gaussian or Variational dropout is reducing the overall GPU utilization.

Hi there, have you solved the problem?