Pytorch v2 High CPU consumption

namanmehta1994 · July 9, 2024, 9:14am

Generally, I work with Pytorch v1, recently, I decided to make an upgrade to Pytorch v2.
However, I notice that that the cpu consumption is really high. It’s almost more than 70%.
My task is image classification using resnet/mobilnet, and I am working with Flower102 Dataset(dummy data, just for reference)
I have gone through the resources such as the followings:

My System Specs:
Ubuntu 22.04.4 LTS
conda: 23.7.4
Python: 3.12.2
torch: 2.2.1
nvidia driver: 555.42.06(12.4 cuda)

Since pytorch v1 works seemlessly for me, I figured its driver/dependecy issue with pytorchv2, so I tried different variations.
But I have tried the sasme process with:

Python package manager: pip, conda, built from source
Python versions: 3.12.2, 3.10.14
torch versions: 2.2.1, 2.1.2
nvidia driver: 555.42.06(12.4 cuda, 12.5 cuda), 470.82.01(12.2 cuda, 12.4 cuda)
What I have tried:
batchsize(I have tried as low as batch size of 8 images)
pin_memory(in dataloader)
num_workers(in dataloader)
OMP_NUM_THREADS(I could not locate this in os environment params, upon checking some docs, this doesn’t really seem like it has any relevance here)

import os
from PIL import Image
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms
import torchmetrics
from net import Net
from tqdm import tqdm
from torchvision.models import resnet50, ResNet50_Weights
from torchvision.models import resnet101, ResNet101_Weights
from torchvision.models import resnet152, ResNet152_Weights
from torchvision.transforms import v2
from torchvision.transforms import ToTensor
import torchvision.datasets

def train(model, criterion, optimizer, train_loader, val_loader, max_epochs=3, target_accuracy=0.99):

    _device=torch.device("cuda" if torch.cuda.is_available() else "cpu")
    start_epoch = 0

    for epoch in range(start_epoch, max_epochs):
        # Training
        model.train()
        with tqdm(train_loader) as pbar:
            pbar.set_description(f"[Train] Epoch {epoch}")
            for images, labels in pbar:
                images = images.to(_device)
                labels = labels.to(_device)
                optimizer.zero_grad()
                outputs = model(images)
                loss = criterion(outputs, labels)
                loss.backward()
                optimizer.step()
                loss = loss.cpu()
                pbar.set_postfix({"loss": loss.item()})

        # Validation
        model.eval()
        metric_acc = torchmetrics.Accuracy("multiclass", num_classes=200)
        with tqdm(val_loader) as pbar:
            pbar.set_description(f"[Valid] Epoch {epoch}")
            for images, labels in pbar:
                images = images.to(_device)
                pred = torch.argmax(model(images), axis=1).cpu()
                metric_acc(pred, labels)
                pbar.set_postfix({"acc": metric_acc.compute().item()})

if __name__ == "__main__":
    transforms_ = v2.Compose([
    v2.Resize(size=(300, 300)),
    ToTensor(),
    ])
    batch_size = 8
    dataset = torchvision.datasets.Flowers102(root="./data", split="train", transform=transforms_, download=True)
    n_samples = len(dataset)
    train_samples = int(n_samples * 0.95)
    val_samples = n_samples - train_samples
    train_dataset, val_dataset = torch.utils.data.random_split(dataset, [train_samples, val_samples])
    train_loader = torch.utils.data.DataLoader(dataset=train_dataset, batch_size=batch_size, 
      shuffle=True, pin_memory=False, num_workers=0)
    val_loader = torch.utils.data.DataLoader(dataset=train_dataset, batch_size=batch_size, shuffle=False)

    model=resnet50(weights=ResNet50_Weights.IMAGENET1K_V2)    
    _device=torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model=model.to(_device)

    criterion = nn.CrossEntropyLoss()
    optimizer = optim.Adam(model.parameters())
    train(model, criterion, optimizer, train_loader, val_loader)

Any points/suggestions would be appreciated.

namanmehta1994 · July 9, 2024, 10:00am

I found the solution here

github.com/pytorch/pytorch

pytorch consuming all cpu cores 100% on ARM

opened 06:30PM - 06 Oct 23 UTC

sanchay-hai

module: binaries triaged module: openmp module: arm

### 🐛 Describe the bug We have a small-ish torch model and runs pretty fast and… as expected on Intel x64. Although, even on Intel, setting `torch.set_num_threads(1)` reduces cpu usage. But with ARM (16 core AWS graviton e.g.) it starts consuming 1600% cpu! Then setting `torch.set_num_threads(1)` tones it a bit down to around `1500%+`. Then setting `export OMP_NUM_THREADS=1` brings it down all the way to 25% cpu. This seems like totally unexpected behavior. ### Versions /usr/local/lib/python3.11/site-packages/torch/nn/modules/transformer.py:20: UserWarning: Failed to initialize NumPy: No module named 'numpy' (Triggered internally at /pytorch/torch/csrc/utils/tensor_numpy.cpp:84.) device: torch.device = torch.device(torch._C._get_default_device()), # torch.device('cpu'), Collecting environment information... PyTorch version: 2.1.0 Is debug build: False CUDA used to build PyTorch: None ROCM used to build PyTorch: N/A OS: Debian GNU/Linux 12 (bookworm) (aarch64) GCC version: (Debian 12.2.0-14) 12.2.0 Clang version: Could not collect CMake version: Could not collect Libc version: glibc-2.36 Python version: 3.11.4 (main, Aug 16 2023, 07:34:21) [GCC 12.2.0] (64-bit runtime) Python platform: Linux-5.19.0-1025-aws-aarch64-with-glibc2.36 Is CUDA available: False CUDA runtime version: No CUDA CUDA_MODULE_LOADING set to: N/A GPU models and configuration: No CUDA Nvidia driver version: No CUDA cuDNN version: No CUDA HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True CPU: Architecture: aarch64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 16 On-line CPU(s) list: 0-15 Vendor ID: ARM Model name: Neoverse-V1 Model: 1 Thread(s) per core: 1 Core(s) per cluster: 16 Socket(s): - Cluster(s): 1 Stepping: r1p1 BogoMIPS: 2100.00 Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm ssbs paca pacg dcpodp svei8mm svebf16 i8mm bf16 dgh rng L1d cache: 1 MiB (16 instances) L1i cache: 1 MiB (16 instances) L2 cache: 16 MiB (16 instances) L3 cache: 32 MiB (1 instance) NUMA node(s): 1 NUMA node0 CPU(s): 0-15 Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Retbleed: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Vulnerability Spectre v1: Mitigation; __user pointer sanitization Vulnerability Spectre v2: Mitigation; CSV2, BHB Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected Versions of relevant libraries: [pip3] torch==2.1.0 [pip3] torchaudio==2.1.0 [conda] Could not collect cc @seemethere @malfet @osalpekar @atalman @snadampal

torch.set_num_threads(n)
This reduces cpu consumption significantly, however, this does not happen in Pytorch v1, so I guess Pytorch developers made change in implementation for v2 and left it to the user to decide number of threads torch would use.
I should have looked for this in the docs, but I guess I missed that!