Does the PyTorch profiler serialize computations on cuda?

I’ve been trying to profile some of my code, and have noticed that it seemed awfully serial. So after a lot of deep dive, I came down to a single example I thought was relevant

import torch
import time

from torch.profiler import profile, record_function, ProfilerActivity

if __name__ == "__main__":

    N = 1000
    mul = 50
    num2 = 50

    profile_it = True

    streams = [torch.cuda.Stream() for _ in range(8)]

    for i in range(8):
        with torch.cuda.stream(streams[i]):
            torch.rand(10, 10, device='cuda:{}'.format(i))

    print(N)
    q2 = []
    qs = []
    for i in range(8):
        with torch.cuda.stream(streams[i]):
            temp = torch.rand(mul * N, mul * N, device='cuda:{}'.format(i))
            qs.append(temp)

    for i in range(8):
        temp = 'cuda:{}'.format(i)
        torch.cuda.synchronize(torch.device(temp))
        print(temp)


    if profile_it == True: 
        with profile(activities=[ProfilerActivity.CUDA]) as prof2:
            for i, q in enumerate(qs):
                with torch.cuda.stream(streams[i]):
                        q2.append(q @ q @ q @ q)

    else:
        for i, q in enumerate(qs):
            with torch.cuda.stream(streams[i]):
                q2.append(q @ q @ q @ q)


    for i in range(8):
        temp = 'cuda:{}'.format(i)
        torch.cuda.synchronize(torch.device(temp))
        print(temp)

    if profile_it:
        prof2.export_chrome_trace('check.json')

So when I keep the profiler off, this is what my gpu usage looks like:

When I turn it on, here’s what it looks like:

For the Nsight profiler using this, it looks like:

With a profile trace that looks right:

Is this expected behavior? I couldnt figure out from the doc strings if I should expect this -

I wouldn’t expect to see a serialization of workloads using the native PyTorch profiler, but I’m also using Nsight Systems for my profiling.
Could you export the trace from the native profiler and also visualize it as it should also show the overlaps (or not in case something weird is indeed going on)?

Sorry @ptrblck , I exported but forgot to attach - attaching a screenshot, which corresponds to what gpustat -i shows

This might be related to 176120 as well

Thanks for sharing the outputs. I’ll try to reproduce the outputs later on one of our systems, as I don’t know how to properly interpret the native PyTorch profiler output since I’m rarely using its visualization.

1 Like