GPU: high memory usage, low GPU volatile-util

Hello!
I am running experiments, but they are extremely slow.

The memory usage of gpu is 8817MiB / 12189MiB, but Volatile GPU-Util is usually 1-4 % and rarely shows 80-100 %.

How can I fix this (except changing batch size)?

Thank you!

Probably you have a bottleneck somewhere, so that your GPU is starving.
I assume you using a DataLoader. Could you increase num_workers?
Are you using pin_memory=True? Is your data on an SSD?

Have a look at this line of code from the ImageNet example to check, if your DataLoader is the reason.

Alternatively, you can have a look aat torch.utils.bottleneck for further debugging.

14 Likes

Thanks for you reply.
I set num_workers = 16 (I increased it from 8, but it didn’t change much). pin_memory was False by default. And yes, my data is on SSD.

Should I change pin_memory to True?

Yes, if you are loading your data in Dataset as CPU tensors and push it later to the GPU.
It will use page-locked memory and speed up the host to device transfer.

The GPU volatile-util is still varies from 2 to 4 %. Here is my dataloader. (opt.batchSize = 32)

train_set = datasets.VCTK(root = '/home/.../dataset/dataset_training', download = False, transform = transforms.PadTrim(max_len=30000))
training_data_loader = DataLoader(dataset = train_set,
                                  pin_memory=True, num_workers=8, batch_size=opt.batchSize,shuffle=True)

Are there any suggestions to fix it?

Have you timed your data loading as shown in the ImageNet example? If so, what time did you get?
Having this number gives us an idea about the bottleneck of your code.

2 Likes

I timed your data loading as shown in the ImageNet example. Here is the beginning of training:

===> Epoch[1](0/510): Loss: 
Time 11.203 (11.203)	Data 7.408 (7.408)	
===> Epoch[1](100/510):
Time 3.530 (3.628)	Data 0.000 (0.077)	
===> Epoch[1](200/510):
Time 3.325 (3.578)	Data 0.001 (0.040)
===> Epoch[1](300/510): 
Time 3.883 (3.555)	Data 0.000 (0.027)	
===> Epoch[1](400/510): 
Time 4.016 (3.555)	Data 0.000 (0.021)	
....

Your DataLoader seems to be fast enough to provide the samples.
Could you post or explain, what your code is doing besides the training procedure?
Do you have any post-processing, which might slow down everything?

2 Likes

Definitely check if your code are blocked by something else, e.g. plotting etc.

Meanwhile how much util do you get if you run a simpler code? Maybe try something like VGG w/ Cifar10. Let’s confirm first that your install works fine.

3 Likes

I tried to simplify code and get rid of unnecessary things and it worked fine! Thank you!

1 Like

Hi,

I also encounter a similar problem, cost of dataset seems normal.

... ...

Epoch: [2][20/296],
Learning_Rate: 0.000492,
Time: 0.9096,       Data:     0.0799,
MIoU: 0.5842,       Accuracy: 0.9174,      Loss: 0.250443
Epoch: [2][40/296],
Learning_Rate: 0.000492,
Time: 0.8930,       Data:     0.0611,
MIoU: 0.5833,       Accuracy: 0.9198,      Loss: 0.249228
Epoch: [2][60/296],
Learning_Rate: 0.000492,
Time: 0.8872,       Data:     0.0547,
MIoU: 0.5920,       Accuracy: 0.9183,      Loss: 0.246295
... ...

I set num_wokers=2 because when I increase it by power 2 it doesn’t have any change. But the Volatile GPU-Util is not stable, it change from 0% to 99%, what could be the potential problem is?

Thank you.

Hi, Shall you share which kind of code you have simplified? Checkpoint saving? Plotting? Visualization?
Once I have read numpy array would be the issue, maybe somewhere I have variable that GPU don’t support and need to run on CPU.

Hi @ptrblck, my GPU GeForce GTX 1060 at Windows 10 is also at 0-1% utilization, but memory is constantly high. I have tried a lot of fixes from different forums but nothing.

Here’s my code.

Any ideas?


crop_size = (100, 100)

transforms = tv.transforms.Compose([
                                    #tv.transforms.transforms.CenterCrop(crop_size),
                                    #tv.transforms.transforms.RandomAffine(5),
                                    # tv.transforms.transforms.RandomHorizontalFlip(),
                                    # tv.transforms.transforms.RandomVerticalFlip(),
                                    tv.transforms.transforms.Resize(crop_size),
                                    #tv.transforms.transforms.RandomRotation(20),
                                    #tv.transforms.transforms.RandomCrop(crop_size),
                                    tv.transforms.transforms.ToTensor(),
                                   ])

batch_size = 128

train_folders = tv.datasets.ImageFolder('./train', transform=transforms)
train_loader = pt.utils.data.DataLoader(train_folders, 
                                        batch_size=batch_size, 
                                        shuffle=True, 
                                        #num_workers=2, 
                                        pin_memory=True)
print('Batches info: {}'.format(next(iter(train_loader))[0].shape))
print('Expected # of batches for epoch {}'.format(int(len(train_folders)/batch_size)))

valid_folders = tv.datasets.ImageFolder('./valid', transform=transforms)
valid_loader = pt.utils.data.DataLoader(valid_folders, 
                                        batch_size=batch_size, 
                                        shuffle=True, 
                                        #num_workers=2, 
                                        pin_memory=True)

test_folders = tv.datasets.ImageFolder('./test', transform=transforms)
test_loader = pt.utils.data.DataLoader(test_folders, 
                                       batch_size=batch_size, 
                                       shuffle=True,
                                       #num_workers=2, 
                                       pin_memory=True)

input_shape = next(iter(train_loader))[0].shape # for connecting convolutional outputs to linear

# CONV model

pt.cuda.is_available(), pt.cuda.current_device(), pt.cuda.get_device_name()
device = 'cuda' if pt.cuda.is_available() else 'cpu'
#device = 'cpu'

class Convolve2D(nn.Module):
    def __init__(self, input_shape, in_channels, out_channels, kernel_size, maxp_kernel_size, output_dim):
        super().__init__()
        
        self.conv = nn.Sequential(nn.Conv2d(in_channels, out_channels, kernel_size),
                                  nn.Dropout(p=0.26),
                                  nn.SELU(),
                                  nn.AvgPool2d(kernel_size=maxp_kernel_size, stride=1),
                                  
                                  nn.BatchNorm2d(num_features=out_channels),
        
                                  nn.Conv2d(out_channels, out_channels, kernel_size),
                                  nn.Dropout(p=0.26),
                                  nn.SELU(),
                                  nn.AvgPool2d(kernel_size=maxp_kernel_size, stride=1),
                                  
                                  nn.BatchNorm2d(num_features=out_channels),
                                  
                                  nn.Conv2d(out_channels, out_channels, kernel_size),
                                  nn.Dropout(p=0.26),
                                  nn.SELU(),
                                  nn.AvgPool2d(kernel_size=maxp_kernel_size, stride=1),                                  
                                  
                                  nn.BatchNorm2d(num_features=out_channels),
                                  
                                  nn.Conv2d(out_channels, out_channels, kernel_size),
                                  nn.Dropout(p=0.26),
                                  nn.SELU(),
                                  nn.AvgPool2d(kernel_size=maxp_kernel_size, stride=1),                                  
                                  
                                  nn.BatchNorm2d(num_features=out_channels),

                                  nn.Conv2d(out_channels, out_channels, kernel_size),
                                  nn.Dropout(p=0.26),
                                  nn.SELU(),
                                  nn.AvgPool2d(kernel_size=maxp_kernel_size, stride=1),  
                                  
                                  nn.BatchNorm2d(num_features=out_channels),

                                  nn.Conv2d(out_channels, out_channels, kernel_size),
                                  nn.Dropout(p=0.26),
                                  nn.SELU(),
                                  nn.AvgPool2d(kernel_size=maxp_kernel_size),
                                  
                                  nn.BatchNorm2d(num_features=out_channels),

                                  nn.Conv2d(out_channels, out_channels, kernel_size),
                                  nn.Dropout(p=0.26),
                                  nn.SELU(),
                                  nn.AvgPool2d(kernel_size=maxp_kernel_size),
                                                                    
                                  nn.Flatten())
        
        self.fc = nn.Sequential(nn.Linear(in_features=self.conv(pt.zeros(*input_shape)).shape[1], 
                                          out_features=output_dim),
                                nn.LogSoftmax(1))
        
    def forward(self, x):
        #print('x.shape', x.shape)
        conv = self.conv(x)
        out = self.fc(conv)
        #print('out.shape', out.shape)
        return out
    
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

input_shape = next(iter(train_loader))[0].shape # for connecting convolutional outputs to linear
conv2d_hp = dict(input_shape=input_shape, 
                   in_channels=3, 
                   out_channels=5, 
                   kernel_size=3,
                   maxp_kernel_size=3,
                   output_dim=12)
conv2d = Convolve2D(**conv2d_hp).to(device)
criterion = nn.NLLLoss(reduction='none') # with loss.mean => the behind the scenes of reduction='mean'
LR = 0.005
optimizer = pt.optim.Adam(conv2d.parameters(), lr=LR)
# print(conv2d)
print('Number of parameters: {:,}'.format(count_parameters(conv2d)))


for epoch in range(3):
    conv2d.train()
    running_loss = []
    #samples = 0
    for i, (image, label) in enumerate(train_loader):
        image = image.to(device, non_blocking=True)
        label = label.to(device, non_blocking=True)
        
        out = conv2d(image)
        loss = criterion(out, label)
        
        conv2d.zero_grad()
        loss.mean().backward()
        optimizer.step()
        
        running_loss.append(loss.mean().item())
        #samples += len(image)
        print('Epoch {}, Step: {}, Loss: {:.4f}'.format(epoch+1, i+1, loss.mean().item()))
    print('Epoch {}, Loss: {:.4f}'.format(epoch+1, np.mean(running_loss)))
    with pt.no_grad():
        conv2d.eval()
        correct = 0
        total = 0
        for image, label in test_loader:
            image = image.to(device)
            label = label.to(device)
            out = conv2d(image)
            #top_p, top_class = out.topk(1, dim=1)
            #equals = top_class == label.view(*top_class.shape)
            #accuracy = pt.mean(equals.type(pt.float))
            predicted = pt.max(out.data, 1)
            total += label.size(0)
            correct += (predicted[1] == label).sum().item()
        print('Test Accuracy on {} images: {:.2f} %'.format(total, 100 * correct / total))
print('Saving model...')
pt.save(conv2d.state_dict(), 'conv2d_checkpoint.pth')
print('Model saved!')```

How are you measuring the GPU utilization?
Could you use nvidia-smi for it?
I’m not using Windows, but have seen some posts here that show some options in your task manager (in case you use it), where you can select the “compute” option, while another option seem to be active by default.

I was looking at the Task Manager’s GPU %. Haven’t seen the compute option.

Have a look at this post. I can’t verify it, as I’m not using Windows.
However, I assume you should also have the nvidia-smi utility to check the actual GPU usage.

1 Like

hello ptrblck
thanks for your hard work!
I have changed my ImageNet directory to ssd and it is super fast…
One question here if I may…
My local computer (two titan xps) suffer less from HDD speed’s bottleck
whereas my server with eight, or twenty rtx 2080ti suffers a lot from HDD…
Would there possibly a reason for this?

And I also heard from coworkers and num_worker =< 8 is always more than enough because setting more workers can bring another bottleneck when distributing the loader’s works. Is this also true?

Your server might need to load much more data to feed the GPUs, so that the data loading bottleneck might be more serious in this case. I would generally assume that an HHD could create a bottleneck even in a single-GPU setup.

I’m not sure you can define a hard threshold of e.g. 8 workers and it should depend on the system you are using, but your coworkers are correct: too many workers might degrade the performance again so you would have to find the “sweet spot”.

hi ptrblck thank you for good comments.
i wonder how much time is enough.
i am doing image regression task with 1024x512, resizing to 400x200.
i got data loading time about 0.016 (0.115).
i have almost 0% gpu utils, rarely 40%. and i already did all things you mentioned in this issue. (ssd, num_worker, pin , etc)
any suggestions?

for anyone who use tensorboard writer.add_image… it was the problem.

1 Like