CPU usage extremely high

Hello, I am running pytorch and the cpu usage of a single thread is exceeding 100. It’s actually over 1000 and near 2000. As a result even though the number of workers are 5 and no other process is running, the cpu load average from ‘htop’ is over 20.

the main process is using over 2000 of cpu usage while the data feeders(workers) are using around 100.

I am using pytorch 1.1 and cuda 9.1.

Are there any other things that I have to check?

Using OMP_NUM_THREADS=1 or torch.set_num_threads() to control cpu parallelization

1 Like

Tried this but did not work. It didn’t have any effect.

from __future__ import print_function
import argparse
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torchvision import datasets, transforms

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(1, 20, 5, 1)
        self.conv2 = nn.Conv2d(20, 50, 5, 1)
        self.fc1 = nn.Linear(4 * 4 * 50, 500)
        self.fc2 = nn.Linear(500, 10)

    def forward(self, x):
        x = F.relu(self.conv1(x))
        x = F.max_pool2d(x, 2, 2)
        x = F.relu(self.conv2(x))
        x = F.max_pool2d(x, 2, 2)
        x = x.view(-1, 4 * 4 * 50)
        x = F.relu(self.fc1(x))
        x = self.fc2(x)
        return F.log_softmax(x, dim=1)

def train(args, model, device, train_loader, optimizer, epoch):
    for batch_idx, (data, target) in enumerate(train_loader):
        data, target = data.to(device), target.to(device)
        output = model(data)
        loss = F.nll_loss(output, target)
        if batch_idx % args.log_interval == 0:
            print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
                epoch, batch_idx * len(data), len(train_loader.dataset),
                       100. * batch_idx / len(train_loader), loss.item()))

def test(args, model, device, test_loader):
    test_loss = 0
    correct = 0
    with torch.no_grad():
        for data, target in test_loader:
            data, target = data.to(device), target.to(device)
            output = model(data)
            test_loss += F.nll_loss(output, target,
                                    reduction='sum').item()  # sum up batch loss
            pred = output.argmax(dim=1,
                                 keepdim=True)  # get the index of the max log-probability
            correct += pred.eq(target.view_as(pred)).sum().item()

    test_loss /= len(test_loader.dataset)

    print('\nTest set: Average loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\n'.format(
        test_loss, correct, len(test_loader.dataset),
        100. * correct / len(test_loader.dataset)))

def main():
    # Training settings
    parser = argparse.ArgumentParser(description='PyTorch MNIST Example')
    parser.add_argument('--batch-size', type=int, default=64, metavar='N',
                        help='input batch size for training (default: 64)')
    parser.add_argument('--test-batch-size', type=int, default=1000, metavar='N',
                        help='input batch size for testing (default: 1000)')
    parser.add_argument('--epochs', type=int, default=10, metavar='N',
                        help='number of epochs to train (default: 10)')
    parser.add_argument('--lr', type=float, default=0.01, metavar='LR',
                        help='learning rate (default: 0.01)')
    parser.add_argument('--momentum', type=float, default=0.5, metavar='M',
                        help='SGD momentum (default: 0.5)')
    parser.add_argument('--no-cuda', action='store_true', default=False,
                        help='disables CUDA training')
    parser.add_argument('--seed', type=int, default=1, metavar='S',
                        help='random seed (default: 1)')
    parser.add_argument('--log-interval', type=int, default=10, metavar='N',
                        help='how many batches to wait before logging training status')

    parser.add_argument('--save-model', action='store_true', default=False,
                        help='For Saving the current Model')
    args = parser.parse_args()
    use_cuda = not args.no_cuda and torch.cuda.is_available()


    device = torch.device("cuda:0" if use_cuda else "cpu")

    kwargs = {'num_workers': 5, 'pin_memory': True} if use_cuda else {}
    train_loader = torch.utils.data.DataLoader(
        datasets.MNIST('../data', train=True, download=True,
                           transforms.Normalize((0.1307,), (0.3081,))
        batch_size=args.batch_size, shuffle=True, **kwargs)
    test_loader = torch.utils.data.DataLoader(
        datasets.MNIST('../data', train=False, transform=transforms.Compose([
            transforms.Normalize((0.1307,), (0.3081,))
        batch_size=args.test_batch_size, shuffle=True, **kwargs)

    model = Net().to(device)
    optimizer = optim.SGD(model.parameters(), lr=args.lr, momentum=args.momentum)

    for epoch in range(1, args.epochs + 1):
        train(args, model, device, train_loader, optimizer, epoch)
        test(args, model, device, test_loader)

    # if (args.save_model):
    #     torch.save(model.state_dict(), "mnist_cnn.pt")

if __name__ == '__main__':

I also tried this small code and the problem continues!!

@ggaemo your example looks simple enough, don’t see any immediate issues.

Something is really wrong though based on the fact the main process is pinning all of your cores in the kernel (red). Something is not happy at a driver/IRQ/IO level. Check to make sure CUDA / GPU drivers match and are compat with your PyTorch, maybe update all of them to latest to be sure. Make sure the drives where your dataset is located are working properly (try copying your dataset to another location and make sure that happens quickly with no issues). Try compiling/running some CUDA sample benchmark apps (the bandwidth tests in utilities perhaps).

1 Like

@ggaemo Coincidentally, I just ran into this issue. I picked up some old code for a new project and gave it a spin to see what state it was in. I looked at the htop and saw a scene familiar to yours ALL of my cores were at 100%, not quite as red as yours but 40-50% in the kernel. I think perhaps the more cores the system has, the more contention. This was an 8-core, 16-hthread system, yours is clearly more. The odd thing, the code was only using 2 work processes which makes little sense as to why there was this much activity unless there is some serious spinning in the kernel.

The solution, set pin_memory=False! Problem solved (for me). Most of my newer code has that off by default as it’s caused me nothing but issues in the past.

After seeing another thread on here today using the MNIST example as there starting point, I noticed I can reproduce it with the mnist example. Can you guess which of these screenshots was with pin_memory=True. No other differences… :slight_smile:


I met this problem too. I think it may caused by different cuda libs. This problem is solved when I run my code on a new environment with different version of cuda toolkit which is installed with pytorch.

I met this problem as well. This problem is solved by setting pin_memory=False.
Interestingly, my PC should be classified as sufficient VRAM/RAM to play with. Dont know why so laggy when pin_memory = True.
The environment is listed as follows:



64GB DDR5 5200MHz

  • CUDA:
  • GPU:
  • NVIDIA GeForce RTX 3090
  • NVIDIA GeForce RTX 3090
  • available: True
  • version: 11.2
  • Packages:
  • numpy: 1.21.5
  • pyTorch_debug: False
  • pyTorch_version: 1.10.0
  • pytorch-lightning: 1.5.8
  • tqdm: 4.62.3
  • System:
  • OS: Linux
  • architecture:
  • 64bit
  • ELF
  • processor: x86_64
  • python: 3.9.7
  • version: #202201071026-Ubuntu SMP Fri Jan 7 16:52:09 UTC 2022

I just hit what could be caused by this. What did you set pin_memory=False on? I do not explicitly set it to True anywhere…