GPU: high memory usage, low GPU volatile-util

YK11 · June 18, 2018, 8:45am

Hello!
I am running experiments, but they are extremely slow.

The memory usage of gpu is 8817MiB / 12189MiB, but Volatile GPU-Util is usually 1-4 % and rarely shows 80-100 %.

How can I fix this (except changing batch size)?

Thank you!

ptrblck · June 18, 2018, 9:03am

Probably you have a bottleneck somewhere, so that your GPU is starving.
I assume you using a DataLoader. Could you increase num_workers?
Are you using pin_memory=True? Is your data on an SSD?

Have a look at this line of code from the ImageNet example to check, if your DataLoader is the reason.

Alternatively, you can have a look aat torch.utils.bottleneck for further debugging.

YK11 · June 18, 2018, 10:01am

Thanks for you reply.
I set num_workers = 16 (I increased it from 8, but it didn’t change much). pin_memory was False by default. And yes, my data is on SSD.

Should I change pin_memory to True?

ptrblck · June 18, 2018, 11:05am

Yes, if you are loading your data in Dataset as CPU tensors and push it later to the GPU.
It will use page-locked memory and speed up the host to device transfer.

YK11 · June 18, 2018, 4:11pm

The GPU volatile-util is still varies from 2 to 4 %. Here is my dataloader. (opt.batchSize = 32)

train_set = datasets.VCTK(root = '/home/.../dataset/dataset_training', download = False, transform = transforms.PadTrim(max_len=30000))
training_data_loader = DataLoader(dataset = train_set,
                                  pin_memory=True, num_workers=8, batch_size=opt.batchSize,shuffle=True)

Are there any suggestions to fix it?

ptrblck · June 18, 2018, 4:22pm

Have you timed your data loading as shown in the ImageNet example? If so, what time did you get?
Having this number gives us an idea about the bottleneck of your code.

YK11 · June 18, 2018, 4:41pm

I timed your data loading as shown in the ImageNet example. Here is the beginning of training:

===> Epoch[1](0/510): Loss: 
Time 11.203 (11.203)	Data 7.408 (7.408)	
===> Epoch[1](100/510):
Time 3.530 (3.628)	Data 0.000 (0.077)	
===> Epoch[1](200/510):
Time 3.325 (3.578)	Data 0.001 (0.040)
===> Epoch[1](300/510): 
Time 3.883 (3.555)	Data 0.000 (0.027)	
===> Epoch[1](400/510): 
Time 4.016 (3.555)	Data 0.000 (0.021)	
....

ptrblck · June 18, 2018, 5:59pm

Your DataLoader seems to be fast enough to provide the samples.
Could you post or explain, what your code is doing besides the training procedure?
Do you have any post-processing, which might slow down everything?

SimonW · June 18, 2018, 10:18pm

Definitely check if your code are blocked by something else, e.g. plotting etc.

Meanwhile how much util do you get if you run a simpler code? Maybe try something like VGG w/ Cifar10. Let’s confirm first that your install works fine.

YK11 · June 18, 2018, 11:43pm

I tried to simplify code and get rid of unnecessary things and it worked fine! Thank you!

MariosOreo · January 25, 2019, 10:30am

Hi,

I also encounter a similar problem, cost of dataset seems normal.

... ...

Epoch: [2][20/296],
Learning_Rate: 0.000492,
Time: 0.9096,       Data:     0.0799,
MIoU: 0.5842,       Accuracy: 0.9174,      Loss: 0.250443
Epoch: [2][40/296],
Learning_Rate: 0.000492,
Time: 0.8930,       Data:     0.0611,
MIoU: 0.5833,       Accuracy: 0.9198,      Loss: 0.249228
Epoch: [2][60/296],
Learning_Rate: 0.000492,
Time: 0.8872,       Data:     0.0547,
MIoU: 0.5920,       Accuracy: 0.9183,      Loss: 0.246295
... ...

I set num_wokers=2 because when I increase it by power 2 it doesn’t have any change. But the Volatile GPU-Util is not stable, it change from 0% to 99%, what could be the potential problem is?

Thank you.

maomao · March 28, 2019, 2:28am

Hi, Shall you share which kind of code you have simplified? Checkpoint saving? Plotting? Visualization?
Once I have read numpy array would be the issue, maybe somewhere I have variable that GPU don’t support and need to run on CPU.

Mauricio_Maroto · January 23, 2020, 5:58am

Hi @ptrblck, my GPU GeForce GTX 1060 at Windows 10 is also at 0-1% utilization, but memory is constantly high. I have tried a lot of fixes from different forums but nothing.

Here’s my code.

Any ideas?


crop_size = (100, 100)

transforms = tv.transforms.Compose([
                                    #tv.transforms.transforms.CenterCrop(crop_size),
                                    #tv.transforms.transforms.RandomAffine(5),
                                    # tv.transforms.transforms.RandomHorizontalFlip(),
                                    # tv.transforms.transforms.RandomVerticalFlip(),
                                    tv.transforms.transforms.Resize(crop_size),
                                    #tv.transforms.transforms.RandomRotation(20),
                                    #tv.transforms.transforms.RandomCrop(crop_size),
                                    tv.transforms.transforms.ToTensor(),
                                   ])

batch_size = 128

train_folders = tv.datasets.ImageFolder('./train', transform=transforms)
train_loader = pt.utils.data.DataLoader(train_folders, 
                                        batch_size=batch_size, 
                                        shuffle=True, 
                                        #num_workers=2, 
                                        pin_memory=True)
print('Batches info: {}'.format(next(iter(train_loader))[0].shape))
print('Expected # of batches for epoch {}'.format(int(len(train_folders)/batch_size)))

valid_folders = tv.datasets.ImageFolder('./valid', transform=transforms)
valid_loader = pt.utils.data.DataLoader(valid_folders, 
                                        batch_size=batch_size, 
                                        shuffle=True, 
                                        #num_workers=2, 
                                        pin_memory=True)

test_folders = tv.datasets.ImageFolder('./test', transform=transforms)
test_loader = pt.utils.data.DataLoader(test_folders, 
                                       batch_size=batch_size, 
                                       shuffle=True,
                                       #num_workers=2, 
                                       pin_memory=True)

input_shape = next(iter(train_loader))[0].shape # for connecting convolutional outputs to linear

# CONV model

pt.cuda.is_available(), pt.cuda.current_device(), pt.cuda.get_device_name()
device = 'cuda' if pt.cuda.is_available() else 'cpu'
#device = 'cpu'

class Convolve2D(nn.Module):
    def __init__(self, input_shape, in_channels, out_channels, kernel_size, maxp_kernel_size, output_dim):
        super().__init__()
        
        self.conv = nn.Sequential(nn.Conv2d(in_channels, out_channels, kernel_size),
                                  nn.Dropout(p=0.26),
                                  nn.SELU(),
                                  nn.AvgPool2d(kernel_size=maxp_kernel_size, stride=1),
                                  
                                  nn.BatchNorm2d(num_features=out_channels),
        
                                  nn.Conv2d(out_channels, out_channels, kernel_size),
                                  nn.Dropout(p=0.26),
                                  nn.SELU(),
                                  nn.AvgPool2d(kernel_size=maxp_kernel_size, stride=1),
                                  
                                  nn.BatchNorm2d(num_features=out_channels),
                                  
                                  nn.Conv2d(out_channels, out_channels, kernel_size),
                                  nn.Dropout(p=0.26),
                                  nn.SELU(),
                                  nn.AvgPool2d(kernel_size=maxp_kernel_size, stride=1),                                  
                                  
                                  nn.BatchNorm2d(num_features=out_channels),
                                  
                                  nn.Conv2d(out_channels, out_channels, kernel_size),
                                  nn.Dropout(p=0.26),
                                  nn.SELU(),
                                  nn.AvgPool2d(kernel_size=maxp_kernel_size, stride=1),                                  
                                  
                                  nn.BatchNorm2d(num_features=out_channels),

                                  nn.Conv2d(out_channels, out_channels, kernel_size),
                                  nn.Dropout(p=0.26),
                                  nn.SELU(),
                                  nn.AvgPool2d(kernel_size=maxp_kernel_size, stride=1),  
                                  
                                  nn.BatchNorm2d(num_features=out_channels),

                                  nn.Conv2d(out_channels, out_channels, kernel_size),
                                  nn.Dropout(p=0.26),
                                  nn.SELU(),
                                  nn.AvgPool2d(kernel_size=maxp_kernel_size),
                                  
                                  nn.BatchNorm2d(num_features=out_channels),

                                  nn.Conv2d(out_channels, out_channels, kernel_size),
                                  nn.Dropout(p=0.26),
                                  nn.SELU(),
                                  nn.AvgPool2d(kernel_size=maxp_kernel_size),
                                                                    
                                  nn.Flatten())
        
        self.fc = nn.Sequential(nn.Linear(in_features=self.conv(pt.zeros(*input_shape)).shape[1], 
                                          out_features=output_dim),
                                nn.LogSoftmax(1))
        
    def forward(self, x):
        #print('x.shape', x.shape)
        conv = self.conv(x)
        out = self.fc(conv)
        #print('out.shape', out.shape)
        return out
    
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

input_shape = next(iter(train_loader))[0].shape # for connecting convolutional outputs to linear
conv2d_hp = dict(input_shape=input_shape, 
                   in_channels=3, 
                   out_channels=5, 
                   kernel_size=3,
                   maxp_kernel_size=3,
                   output_dim=12)
conv2d = Convolve2D(**conv2d_hp).to(device)
criterion = nn.NLLLoss(reduction='none') # with loss.mean => the behind the scenes of reduction='mean'
LR = 0.005
optimizer = pt.optim.Adam(conv2d.parameters(), lr=LR)
# print(conv2d)
print('Number of parameters: {:,}'.format(count_parameters(conv2d)))


for epoch in range(3):
    conv2d.train()
    running_loss = []
    #samples = 0
    for i, (image, label) in enumerate(train_loader):
        image = image.to(device, non_blocking=True)
        label = label.to(device, non_blocking=True)
        
        out = conv2d(image)
        loss = criterion(out, label)
        
        conv2d.zero_grad()
        loss.mean().backward()
        optimizer.step()
        
        running_loss.append(loss.mean().item())
        #samples += len(image)
        print('Epoch {}, Step: {}, Loss: {:.4f}'.format(epoch+1, i+1, loss.mean().item()))
    print('Epoch {}, Loss: {:.4f}'.format(epoch+1, np.mean(running_loss)))
    with pt.no_grad():
        conv2d.eval()
        correct = 0
        total = 0
        for image, label in test_loader:
            image = image.to(device)
            label = label.to(device)
            out = conv2d(image)
            #top_p, top_class = out.topk(1, dim=1)
            #equals = top_class == label.view(*top_class.shape)
            #accuracy = pt.mean(equals.type(pt.float))
            predicted = pt.max(out.data, 1)
            total += label.size(0)
            correct += (predicted[1] == label).sum().item()
        print('Test Accuracy on {} images: {:.2f} %'.format(total, 100 * correct / total))
print('Saving model...')
pt.save(conv2d.state_dict(), 'conv2d_checkpoint.pth')
print('Model saved!')```

ptrblck · January 23, 2020, 6:18am

How are you measuring the GPU utilization?
Could you use nvidia-smi for it?
I’m not using Windows, but have seen some posts here that show some options in your task manager (in case you use it), where you can select the “compute” option, while another option seem to be active by default.

Mauricio_Maroto · January 23, 2020, 2:33pm

I was looking at the Task Manager’s GPU %. Haven’t seen the compute option.

ptrblck · January 23, 2020, 11:19pm

Have a look at this post. I can’t verify it, as I’m not using Windows.
However, I assume you should also have the nvidia-smi utility to check the actual GPU usage.

ooodragon · July 20, 2020, 12:08pm

hello ptrblck
thanks for your hard work!
I have changed my ImageNet directory to ssd and it is super fast…
One question here if I may…
My local computer (two titan xps) suffer less from HDD speed’s bottleck
whereas my server with eight, or twenty rtx 2080ti suffers a lot from HDD…
Would there possibly a reason for this?

And I also heard from coworkers and num_worker =< 8 is always more than enough because setting more workers can bring another bottleneck when distributing the loader’s works. Is this also true?

ptrblck · July 21, 2020, 4:26am

Your server might need to load much more data to feed the GPUs, so that the data loading bottleneck might be more serious in this case. I would generally assume that an HHD could create a bottleneck even in a single-GPU setup.

I’m not sure you can define a hard threshold of e.g. 8 workers and it should depend on the system you are using, but your coworkers are correct: too many workers might degrade the performance again so you would have to find the “sweet spot”.

sky518303 · February 10, 2022, 5:27am

hi ptrblck thank you for good comments.
i wonder how much time is enough.
i am doing image regression task with 1024x512, resizing to 400x200.
i got data loading time about 0.016 (0.115).
i have almost 0% gpu utils, rarely 40%. and i already did all things you mentioned in this issue. (ssd, num_worker, pin , etc)
any suggestions?

sky518303 · February 10, 2022, 6:34am

for anyone who use tensorboard writer.add_image… it was the problem.