Another simple test case to understand the boundary conditions for compile:
@torch.compile()
def affine(x, a, b):
return a * x + b
alst = [3., 4., 5., 6.]
blst = [2., 3., 7., -3.2]
nelems = 1000
x = torch.randn(nelems, dtype=torch.float32, device="cuda")
for a, b in zip(alst, blst)
affine(x, a, b)
My observation is that this code will generate 4 compiled kernels, each one specialized on the constant input values for a
and b
. This is in contrast with a dynamically-sized x
, in which case you get the first size-specialized kernel, followed by a variable-sized one used for subsequent invocations.
It’s possible to avoid recomputation for every new a/b by making those into cuda tensors and copy_
ing the value in, but that’s pretty kludgey. I’m guessing this is a dynamo issue – is it possible to get it to generate a variable-scalar-input graph for a
and b
? I understand that it can’t in the case of cuda graph execution, but for eager execution, this is basically the same thing as dynamic shapes.