Pytorch Profiler not profiling GPU on Windows

Hello everyone,

I’m new here, hopefully I write this in the correct way.

I’ve recently gotten to use PyTorch’s profiler but I can’t seem to see any activity on my GPU as far as the profiler is concerned. Currently I’m running the example as seen on this guide. The code runs no problem and compiles. I can see activity on my GPU and the CUDA graph in task manager (showing specifically the CUDA graph, I did my homework) is showing activity when I run the code so it clearly is not PyTorch nor CUDA my problem.

Here is the code I use to create the model and start the profiler:

def main():
    transform = T.Compose(
        [T.Resize(224),
         T.ToTensor(),
         T.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])
    train_set = torchvision.datasets.CIFAR10(root='./data', train=True, download=True, transform=transform)
    train_loader = torch.utils.data.DataLoader(train_set, batch_size=32, shuffle=True, num_workers=4)

    device = torch.device("cuda:0")
    model = torchvision.models.resnet18(pretrained=True).to(device)
    criterion = torch.nn.CrossEntropyLoss().to(device)
    optimizer = torch.optim.SGD(model.parameters(), lr=0.001, momentum=0.9)
    model.train()

    def train(data):
        inputs, labels = data[0].to(device), data[1].to(device)
        outputs = model(inputs)
        loss = criterion(outputs, labels)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

    with torch.profiler.profile(
            schedule=torch.profiler.schedule(wait=1, warmup=1, active=3),
            on_trace_ready=torch.profiler.tensorboard_trace_handler('./log/resnet18'),
            record_shapes=True,
            profile_memory=True,
            with_stack=True
    ) as prof:
        for step, batch_data in enumerate(train_loader):
            if step >= 200:  # tried increasing this because maybe it wasn't running long enough?
                break
            train(batch_data)
            prof.step()  # Need to call this at the end of each step to notify profiler of steps' boundary.


if __name__ == '__main__':
    main()

And yet when I open TensorBoard at the appropriate location I see only this:


And the GPU Kernel element is completely missing from the side bar as well.

I don’t understand what I am doing wrong in this case. Is this just a Windows problem or am I using the profiler incorrectly?

Thank you for your help!
ChowderII

5 Likes

Hi Chowderll, have you found the solution? I’m using Windows too and that’s the same problem with me. Just CPU and other, no GPU.

Same problem here. Am also on Windows and can’t see the GPU summary.

1 Like

Hello

I am using WIN10, and I can’t see any GPU Kernel option on tensorboard View list.

Hi, I am experiencing the very same problem on Windows 10 and my model is definitely running on the GPU.

Any clue?

I have the same problem, is anyone care and solve this?

Same issue here. Any idea on this @ptrblck? Your profile seems to pop up in all the relevant forms I’ve seen throughout the years so here’s to hoping you know what’s going on.

Sorry, but I’m neither familiar with Windows nor with the native PyTorch profiler and am using Nsight Systems to profile models.
You could try to use it too: Getting Started with Nsight Systems | NVIDIA Developer

2 Likes

I’ll check it out, thank you!

Have anyone solved this problem or work out some clues? I have encountered the same issue.

Maybe I’m late, but has anyone solved this problem?
my environment:
cuda11.8
win10
i512500k
rtx4070ti

Was anyone able to resolve this? I’m facing the same issue.

Hi, all
The CUPTI seems have issue working on Windows platform, so the CUPTI is anyway disabled when building on Windows, that is why you don’t have any GPU op info traced. It looks like the kineto supports Windows?
Thank you.

I have also the same problem.

As someone who is able to profile GPU usage on Windows, I’d be happy to help troubleshoot :slight_smile:

I recently posted some GPU profiling code that worked for me: Model() uses GPU but backwards() doesn't - #3 by neoncube

It looks like your code isn’t passing activities=[profiler.ProfilerActivity.CPU, profiler.ProfilerActivity.CUDA] to torch.profiler.profile(), which I thought was required.

Also, for me, passing both activities=[profiler.ProfilerActivity.CPU, profiler.ProfilerActivity.CUDA] and with_stack=True crashed the process with no error, so you might want to try removing record_shapes, profile_memory, and with_stack.

I’d also be curious to see if calling prof.key_averages().table(sort_by='cuda_time_total', row_limit=10 or prof.key_averages().table(sort_by='cpu_time_total', row_limit=10)works and if this is just an issue exporting to Tensorboard, specifically.

1 Like

I’m currently using Windows 10 and torch 1.13.1+cu117 on an ROG Zephyrus G14, and can confirm that print(prof.key_averages().table(sort_by="self_cuda_time_total", row_limit=10)) shows Self CUDA time total: 749.103ms for the following script:

import torch
from torch.profiler import profile, ProfilerActivity

x = torch.randn(4000, 4000, device='cuda')
y = torch.randn(4000, 4000, device='cuda')

with profile(
    activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA]
) as prof:
    z = torch.matmul(x, y)
    torch.cuda.synchronize()  # ensure kernel finishes

prof.export_chrome_trace("cuda_matmul_trace.json")

print(prof.key_averages().table(sort_by="self_cuda_time_total", row_limit=10))

However, when I load the output .json file in chrome://trace, there’s no GPU events. Additionally, I uploaded the json file itself to ChatGPT which verified that no GPU events were included in the trace.

Thus, this does seem to be an issue in the export, rather than the tracking.

For me, a restart helped :see_no_evil_monkey:

In my case, the issue was resolved after ensuring that the PyTorch CUDA version matches the CUDA version of my device. You can check your PyTorch version using pip show torch and the cuda version using nvidia-smi. For example, if your PyTorch version is 2.6.0+cu128, your cuda version should be CUDA 12.8.