CPU usage extremely high

ggaemo · July 31, 2019, 10:00am

Hello, I am running pytorch and the cpu usage of a single thread is exceeding 100. It’s actually over 1000 and near 2000. As a result even though the number of workers are 5 and no other process is running, the cpu load average from ‘htop’ is over 20.

the main process is using over 2000 of cpu usage while the data feeders(workers) are using around 100.

I am using pytorch 1.1 and cuda 9.1.

Are there any other things that I have to check?

cyanM · August 1, 2019, 7:55am

Using OMP_NUM_THREADS=1 or torch.set_num_threads() to control cpu parallelization

ggaemo · August 18, 2019, 3:58am

Tried this but did not work. It didn’t have any effect.

from __future__ import print_function
import argparse
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torchvision import datasets, transforms


class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(1, 20, 5, 1)
        self.conv2 = nn.Conv2d(20, 50, 5, 1)
        self.fc1 = nn.Linear(4 * 4 * 50, 500)
        self.fc2 = nn.Linear(500, 10)

    def forward(self, x):
        x = F.relu(self.conv1(x))
        x = F.max_pool2d(x, 2, 2)
        x = F.relu(self.conv2(x))
        x = F.max_pool2d(x, 2, 2)
        x = x.view(-1, 4 * 4 * 50)
        x = F.relu(self.fc1(x))
        x = self.fc2(x)
        return F.log_softmax(x, dim=1)


def train(args, model, device, train_loader, optimizer, epoch):
    model.train()
    for batch_idx, (data, target) in enumerate(train_loader):
        data, target = data.to(device), target.to(device)
        optimizer.zero_grad()
        output = model(data)
        loss = F.nll_loss(output, target)
        loss.backward()
        optimizer.step()
        if batch_idx % args.log_interval == 0:
            print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
                epoch, batch_idx * len(data), len(train_loader.dataset),
                       100. * batch_idx / len(train_loader), loss.item()))


def test(args, model, device, test_loader):
    model.eval()
    test_loss = 0
    correct = 0
    with torch.no_grad():
        for data, target in test_loader:
            data, target = data.to(device), target.to(device)
            output = model(data)
            test_loss += F.nll_loss(output, target,
                                    reduction='sum').item()  # sum up batch loss
            pred = output.argmax(dim=1,
                                 keepdim=True)  # get the index of the max log-probability
            correct += pred.eq(target.view_as(pred)).sum().item()

    test_loss /= len(test_loader.dataset)

    print('\nTest set: Average loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\n'.format(
        test_loss, correct, len(test_loader.dataset),
        100. * correct / len(test_loader.dataset)))


def main():
    # Training settings
    parser = argparse.ArgumentParser(description='PyTorch MNIST Example')
    parser.add_argument('--batch-size', type=int, default=64, metavar='N',
                        help='input batch size for training (default: 64)')
    parser.add_argument('--test-batch-size', type=int, default=1000, metavar='N',
                        help='input batch size for testing (default: 1000)')
    parser.add_argument('--epochs', type=int, default=10, metavar='N',
                        help='number of epochs to train (default: 10)')
    parser.add_argument('--lr', type=float, default=0.01, metavar='LR',
                        help='learning rate (default: 0.01)')
    parser.add_argument('--momentum', type=float, default=0.5, metavar='M',
                        help='SGD momentum (default: 0.5)')
    parser.add_argument('--no-cuda', action='store_true', default=False,
                        help='disables CUDA training')
    parser.add_argument('--seed', type=int, default=1, metavar='S',
                        help='random seed (default: 1)')
    parser.add_argument('--log-interval', type=int, default=10, metavar='N',
                        help='how many batches to wait before logging training status')

    parser.add_argument('--save-model', action='store_true', default=False,
                        help='For Saving the current Model')
    args = parser.parse_args()
    use_cuda = not args.no_cuda and torch.cuda.is_available()

    torch.manual_seed(args.seed)

    device = torch.device("cuda:0" if use_cuda else "cpu")

    kwargs = {'num_workers': 5, 'pin_memory': True} if use_cuda else {}
    train_loader = torch.utils.data.DataLoader(
        datasets.MNIST('../data', train=True, download=True,
                       transform=transforms.Compose([
                           transforms.ToTensor(),
                           transforms.Normalize((0.1307,), (0.3081,))
                       ])),
        batch_size=args.batch_size, shuffle=True, **kwargs)
    test_loader = torch.utils.data.DataLoader(
        datasets.MNIST('../data', train=False, transform=transforms.Compose([
            transforms.ToTensor(),
            transforms.Normalize((0.1307,), (0.3081,))
        ])),
        batch_size=args.test_batch_size, shuffle=True, **kwargs)

    model = Net().to(device)
    optimizer = optim.SGD(model.parameters(), lr=args.lr, momentum=args.momentum)

    for epoch in range(1, args.epochs + 1):
        train(args, model, device, train_loader, optimizer, epoch)
        test(args, model, device, test_loader)

    # if (args.save_model):
    #     torch.save(model.state_dict(), "mnist_cnn.pt")


if __name__ == '__main__':
    main()

I also tried this small code and the problem continues!!

rwightman · August 18, 2019, 4:58am

@ggaemo your example looks simple enough, don’t see any immediate issues.

Something is really wrong though based on the fact the main process is pinning all of your cores in the kernel (red). Something is not happy at a driver/IRQ/IO level. Check to make sure CUDA / GPU drivers match and are compat with your PyTorch, maybe update all of them to latest to be sure. Make sure the drives where your dataset is located are working properly (try copying your dataset to another location and make sure that happens quickly with no issues). Try compiling/running some CUDA sample benchmark apps (the bandwidth tests in utilities perhaps).

rwightman · August 21, 2019, 6:46pm

@ggaemo Coincidentally, I just ran into this issue. I picked up some old code for a new project and gave it a spin to see what state it was in. I looked at the htop and saw a scene familiar to yours ALL of my cores were at 100%, not quite as red as yours but 40-50% in the kernel. I think perhaps the more cores the system has, the more contention. This was an 8-core, 16-hthread system, yours is clearly more. The odd thing, the code was only using 2 work processes which makes little sense as to why there was this much activity unless there is some serious spinning in the kernel.

The solution, set pin_memory=False! Problem solved (for me). Most of my newer code has that off by default as it’s caused me nothing but issues in the past.

After seeing another thread on here today using the MNIST example as there starting point, I noticed I can reproduce it with the mnist example. Can you guess which of these screenshots was with pin_memory=True. No other differences…

Rongjie_Li · April 30, 2020, 7:26am

I met this problem too. I think it may caused by different cuda libs. This problem is solved when I run my code on a new environment with different version of cuda toolkit which is installed with pytorch.

allanchan339 · January 27, 2022, 3:58am

I met this problem as well. This problem is solved by setting pin_memory=False.
Interestingly, my PC should be classified as sufficient VRAM/RAM to play with. Dont know why so laggy when pin_memory = True.
The environment is listed as follows:

Environment

CPU:
i9-12900K

RAM:
64GB DDR5 5200MHz

CUDA:

GPU:
NVIDIA GeForce RTX 3090
NVIDIA GeForce RTX 3090
available: True
version: 11.2

Packages:

numpy: 1.21.5
pyTorch_debug: False
pyTorch_version: 1.10.0
pytorch-lightning: 1.5.8
tqdm: 4.62.3

System:

OS: Linux
architecture:
64bit
ELF
processor: x86_64
python: 3.9.7
version: #202201071026-Ubuntu SMP Fri Jan 7 16:52:09 UTC 2022

lostmsu · February 14, 2023, 6:51pm

I just hit what could be caused by this. What did you set pin_memory=False on? I do not explicitly set it to True anywhere…