Does dynamo trigger real kernel execution?

yjguo · December 19, 2024, 12:39pm

I think the answer is NO according to “Dynamo Overview — PyTorch 2.5 documentation”:

Dynamo hooks into the frame evaluation API in CPython (PEP 523) to dynamically modify Python bytecode right before it is executed.
It creates this FX Graph through bytecode analysis

With above statement, I think dynamo does the python bytecode analysis w/o executing the real kernels, to generate a computation graph in torch IR.

But, I tried the below code, and I think it does not match my understanding. It looks that dynamo triggers real kernel execution on device.

Since I only care about dynamo, so I choose the backend as ‘eager’ below, to not involve AOTAutograd.

Case 1:

import torch

class TestNet(torch.nn.Module):
    def __init__(self):
        super(TestNet, self).__init__()
        self.linear = torch.nn.Linear(3, 2)

    def forward(self, x):
        x = x * 2
        if x.sum() > 0:
            x = torch.sin(x)
        if x.mean() > 0:
            x = torch.cos(x)
        x = self.linear(x)
        return x

m = TestNet().cuda()
m = torch.compile(m, backend="eager")
inputs = torch.ones(3).cuda()
m(inputs)

with TORCH_LOGS="+dynamo" python -u eagerbackend.py, we can see from the log that sin and cos are in the computation graphs.

Case 2:

import torch

class TestNet(torch.nn.Module):
    def __init__(self):
        super(TestNet, self).__init__()
        self.linear = torch.nn.Linear(3, 2)

    def forward(self, x):
        x = x * 2
        if x.sum() > 0:
            x = torch.sin(x)
        if x.mean() > 0:
            x = torch.cos(x)
        x = self.linear(x)
        return x

m = TestNet().cuda()
m = torch.compile(m, backend="eager")
inputs = torch.ones(3).cuda() * -1.0
m(inputs)

With TORCH_LOGS="+dynamo" python -u eagerbackend.py, we do not see anything about sin and cos in the log.

So, putting the two cases together, I think that case 1 implies that the code goes into the if body, and case 2 implies that the code does not go into the ‘if’ body. That’s correct, BUT, we can only know it by really executing the kernel, and so we can know the value of x.sum() and x.mean().

This experiment gives me the answer that dynamo triggers the real kernel execution, but I think it conflicts with the document of ‘dynamo overview’. Could you help to clarification? thanks.

ptrblck · December 19, 2024, 2:15pm

On your code snippets you are not using pure Dynamo operations but are executing the model after calling torch.compile on it. As the docs also explain, after Dynamo’s processing, a backend will be used to code-gen and/or execute the kernels. So of course the forward pass of a compiled model will show executed kernels.

yjguo · December 19, 2024, 2:27pm

thanks for the replay.

My understanding is that m(inputs) will trigger two steps:

dynamo converts the python bytecode to FX graph. And TORCH_LOGS="+dynamo" shows what the FX graph looks like.
the FX graph is executed. I understand there’s real kernel execution at this step.

My question is if there’s real kernel execution in the step 1. From the log of case 1 and case 2, the generated FX graphs from Dynamo (in step 1) are different, and so I think it implies that there’s real kernel execution in step 1. Otherwise, we should see the same FX graph for both case 1 and case 2 in step 1.

yjguo · February 8, 2025, 6:30am

Compiled Autograd: Capturing a larger backward graph for torch.compile — PyTorch Tutorials 2.6.0+cu124 documentation mentions that:

Dynamo intercepts the Python bytecode, “simulates” their execution and records the operations into a graph.

what does “simulates” mean, is there a real kernel execution within Dynamo?