Weird profiler stack trace : Not sure if GPU is enabled on a local machine

I am currently working on profiling, learning about torch profiler and tensorboard using it. I am using this tutorial : PyTorch Profiler With TensorBoard — PyTorch Tutorials 2.2.0+cu121 documentation.

The thing is that I tried it using google colab & my own local computer that has a RTX2080. When I run the exact tutorial code with colab I am obtaining a similar report, telling me about GPU stuff…
Yet when I run on my machine (after a call of .cuda() I checked that my models & data were on cuda and according to pytorch they are) I am obtaining the following result :

This seems to indicate that all the operation performed were performed on CPU (and the device is CPU), while my device is cuda.

I believe that it might be because I think I did not install cudatoolkit on my machine (but I did install torch using pip), but can this line :

device = cuda if torch.cuda.is_available() else cpu

return cuda if cudatoolkit is not installed on my machine ? Is it possible to put models & data on cuda and still perform operation on cpu ?

And if it is and if I am performing operation on cpu : I noticed that it is still way quicker to have model & data on cuda than on cpu, how can it be quicker ?

Thank you for your time and help

Yes, since PyTorch binaries ship with their own CUDA dependencies and you thus won’t need to install a local CUDA toolkit.

If the model and data were moved to the GPU their operations will be performed on the GPU. Other tensors (on the CPU) can still perform operations on the CPU.

If you moved the model and data to the GPU all operations will be performed on the GPU.

To your profiler question: I’m not familiar enough with the built-in profiler and use Nsight Systems instead, so you might also give it a try.

1 Like

Thank you, I will have a look and make a profiling using it, we’ll see if this one understand that operations are done on GPU.

This answer helped me to get GPU recognized by the profiler with a side effect.

This command gets the “Distributed” view to appear, but the GPU is not recognized.

srun nsys profile --gpu-metrics-device=0 --sample=none --cpuctxsw=none --force-overwrite true --nic-metrics=true --trace=cuda,nvtx,cudnn --output=logs/${EXP_NAME}/nsys_logs_%h torchrun \
--nnodes $NUM_NODES \
--nproc_per_node 1 \
--rdzv_id $RANDOM \
--rdzv_backend c10d \
--rdzv_endpoint $MASTER_ADDR:$MASTER_PORT \ --epochs=10

While this command gets the “Distributed” view to diappear, and the GPU to be recognized.

torchrun \
--nnodes $NUM_NODES \
--nproc_per_node 1 \
--rdzv_id $RANDOM \
--rdzv_backend c10d \
--rdzv_endpoint $MASTER_ADDR:$MASTER_PORT \ --epochs=10

My PyTorch Code

import argparse

import torch
import torch.distributed as dist
import torch.nn as nn
import torchvision
import torchvision.transforms as transforms
from torch.cuda.amp import GradScaler, autocast
from torch.nn.parallel import DistributedDataParallel
from torch.utils import data
from torchvision import models

precisions = {
    "fp16": torch.float16,
    "fp32": torch.float32,

def train(train_sampler, epoch, device, train_loader, model, loss_criterion, optimizer, profiler, scaler, precision):
    for batch_idx, (inputs, targets) in enumerate(train_loader):
        inputs, targets = inputs.cuda(device), targets.cuda(device)

        with autocast(dtype=precision):
            outputs = model(inputs)
            loss = loss_criterion(outputs, targets)


        if batch_idx % 10 == 0:
            current_samples = batch_idx * len(inputs)
            total_samples = len(train_sampler)
            progress = 100.0 * current_samples / total_samples
            print(f'Epoch: {epoch} [{current_samples}/{total_samples} ({progress:.0f}%)]\tLoss: {loss.item():.6f}')

def validate(epoch, device, validation_loader, model, loss_criterion):
    test_loss = 0
    correct = 0
    with torch.no_grad():
        for inputs, targets in validation_loader:
            inputs, targets = inputs.cuda(device), targets.cuda(device)
            outputs = model(inputs)
            test_loss += loss_criterion(outputs, targets).item()
            _, predicted = outputs.max(1)
            correct += predicted.eq(targets).sum().item()

    test_loss /= len(validation_loader.dataset)
    accuracy = 100. * correct / len(validation_loader.dataset)
    print(f'\nValidation. Epoch: {epoch}, Loss: {test_loss:.4f}, Accuracy: ({accuracy:.2f}%)\n')

def cleanup():

def get_data(rank, world_size, args):
    transform = transforms.Compose([
        transforms.RandomCrop(32, padding=4),
        transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))

    train_set = torchvision.datasets.CIFAR10(root='./data', train=True, download=True, transform=transform)
    train_sampler = data.distributed.DistributedSampler(train_set, num_replicas=world_size, rank=rank)
    train_loader = data.DataLoader(train_set, sampler=train_sampler, batch_size=args.batch_size, num_workers=1,

    validation_set = torchvision.datasets.CIFAR10(root='./data', train=False, download=True, transform=transform)
    validation_loader = data.DataLoader(validation_set, batch_size=args.batch_size, num_workers=1,

    return train_sampler, train_loader, validation_loader

def main(args, rank, world_size):
    train_sampler, train_loader, validation_loader = get_data(rank, world_size, args)

    device = torch.device("cuda:0")
    print(f"Rank {rank}: Using CUDA device {device}")

    model = models.resnet18(weights=None, num_classes=10)
    model.conv1 = nn.Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
    model = model.cuda(device)

    ddp_model = DistributedDataParallel(model)
    scaler = GradScaler()

    loss_criterion = nn.CrossEntropyLoss().cuda(device)
    optimizer = torch.optim.AdamW(model.parameters(), lr=0.001, weight_decay=5e-4)
    scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=30, gamma=0.1)


    with torch.profiler.profile(
            activities=[torch.profiler.ProfilerActivity.CPU, torch.profiler.ProfilerActivity.CUDA],
            schedule=torch.profiler.schedule(wait=1, warmup=1, active=10, repeat=1),
    ) as profiler:
        for epoch in range(args.epochs):
            train(train_sampler, epoch, device, train_loader, ddp_model, loss_criterion, optimizer, profiler, scaler,
            validate(epoch, device, validation_loader, ddp_model, loss_criterion)

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--epochs", type=int, default=10)
    parser.add_argument("--batch-size", type=int, default=512)
    parser.add_argument("--model", type=str, default="resnet18")
    parser.add_argument("--precision", type=str, default="fp32")
    parser.add_argument("--exp-name", type=str, default="test")
    args = parser.parse_args()

    # distributed
    torch.distributed.init_process_group(backend='nccl', init_method='env://')
    rank = torch.distributed.get_rank()
    world_size = torch.distributed.get_world_size()

    main(args, rank, world_size)

Check why your slurm environment seems to have issues with your GPU or use Nsight Systems directly on torchrun.

Like what issues?