Another simple test case to understand the boundary conditions for compile:
@torch.compile() def affine(x, a, b): return a * x + b alst = [3., 4., 5., 6.] blst = [2., 3., 7., -3.2] nelems = 1000 x = torch.randn(nelems, dtype=torch.float32, device="cuda") for a, b in zip(alst, blst) affine(x, a, b)
My observation is that this code will generate 4 compiled kernels, each one specialized on the constant input values for
b. This is in contrast with a dynamically-sized
x, in which case you get the first size-specialized kernel, followed by a variable-sized one used for subsequent invocations.
It’s possible to avoid recomputation for every new a/b by making those into cuda tensors and
copy_ing the value in, but that’s pretty kludgey. I’m guessing this is a dynamo issue – is it possible to get it to generate a variable-scalar-input graph for
b? I understand that it can’t in the case of cuda graph execution, but for eager execution, this is basically the same thing as dynamic shapes.