Passing Python scalars to compiled functions

Another simple test case to understand the boundary conditions for compile:

def affine(x, a, b):
    return a * x + b

alst = [3., 4., 5., 6.]
blst = [2., 3., 7., -3.2]
nelems = 1000
x = torch.randn(nelems, dtype=torch.float32, device="cuda")
for a, b in zip(alst, blst)
    affine(x, a, b)

My observation is that this code will generate 4 compiled kernels, each one specialized on the constant input values for a and b. This is in contrast with a dynamically-sized x, in which case you get the first size-specialized kernel, followed by a variable-sized one used for subsequent invocations.

It’s possible to avoid recomputation for every new a/b by making those into cuda tensors and copy_ing the value in, but that’s pretty kludgey. I’m guessing this is a dynamo issue – is it possible to get it to generate a variable-scalar-input graph for a and b? I understand that it can’t in the case of cuda graph execution, but for eager execution, this is basically the same thing as dynamic shapes.