Triton Error [CUDA]: an illegal memory access was encountered

I’m trying to leverage torch.compile(model) in an existing largish application called AUTOMATIC1111(stable difusion inference). When run with a compiled model I get:
RuntimeError: Triton Error [CUDA]: an illegal memory access was encountered
Through instrumention I’ve found that it appears to create on the fly PY benchmark programs that are run to determine the best way to do something. The function generated in my /tmp dir is:

@triton.jit
def triton_mm(in_ptr0, arg_A, arg_B, out_ptr0):
    <body deleted for brevity>
    tl.store(out_ptr0 + (xindex + tl.zeros(mask.shape, tl.int32)), tmp1, mask)

My guess is that this last line of the function is writting outside the bounds of the output variable. If I try to print the 5 var’s used in tl.store() I get:

out_ptr0   pointer<fp16>[constexpr[1]]
xindex     int32[constexpr[32],constexpr[32]]
mask.shape [constexpr[32], constexpr[32]]
tmp1       fp32[constexpr[32],constexpr[32]]
mask       int1[constexpr[32],constexpr[32]]

I’m still new to python. What are these? I want to see the actual values so I can see if the index offset is crazy. What was printed look like types instead of values. I tried to print “out_ptr0[0]” and got a syntax error. Is it an array?
NOTE:
print(f"xxx = {xxx}") doesn’t work giving an
NotImplementedError: Unsupported node: JoinedStr
error as if this is some restricted python that can only print simple strings and values. Also print(type(out_ptr0)) doesn’t work.

I discovered that Triton doesn’t “support printing a variable”.
Really, I’m not joking. There’s an alternative syntax to print some things but it doesn’t always work and if you want to print something like type(var) or var.size() then you can forget about it.

But that just makes it more of a challenge.
Flying blind, but I did stumble on select_algorithm.py call->run which I believe is actually invoking the external jit triton_mm function. This is REAL Python code so I can instrument it without issues.

Cross-post from here with a follow-up.
Could you post a minimal and executable code snippet to debug the issue or grab the stacktrace from cuda-gdb, please?

Cross-post from here with a follow-up.

Is that a comment or a request? What I put “there” is off topic for the codegen issue. After I posted there I decide this belonged as a new issue. All the info there is now here. Are you asking me to maje a note there that this discussion is now here? Sorry for being dense.

Could you post a minimal and executable code snippet to debug the issue

Simplifying 100 thousand lines of code I didn’t write isn’t easy.
I did figure out how to create a stand alone call to jit’ed kernel executed triton_mm matching the real one that crashes. But when I run it there isn’t a problem

I’ll see if cuda-gdb catches the error at the first point it finds it so I can get a stack.

It’s a comment so that other users can track the previous discussion and avoid re-debugging potentially already discussed issues.

Yes, creating a minimal code snippet is often the challenging part, but let me know in case you were able to narrow down the issue from the 100 thousand lines of code repo.

If Triton had basic printf type debugging, the oldest kind there is, I could have figured this out hours ago. I’m 98% sure this is writting outside the out_ptr0 boundaries. But the tl.printf() work around the Triton folks gave me today doesn’t work. xindex likely has bad values. In fact I can get the same corruption error in my stand alone test by just adding a big offset to out_ptr0 in tl.store() but that isn’t a real repo.

However, here you go:

CUDA Exception: Warp Out-of-range Address
The exception was triggered at PC 0x7f88d360eb90
Thread 91 "python3" received signal CUDA_EXCEPTION_5, Warp Out-of-range Address.
[Switching focus to CUDA kernel 0, grid 22591, block (0,0,0), thread (128,0,0), device 0, sm 0, warp 4, lane 0]
0x00007f88d360ebb0 in triton_mm_0d1d2d3d<<<(40,1,1),(256,1,1)>>> ()

There is just the one line in the stack trace. x/i $pc gives me:

    HMMA.16816.F32 R4, R12, R8, R4 
R4 = 0x411a7319
R8 = 0x3943bb2d
R12 = 0x305439af

Frustrating because I can dump the array of xindex’s in my stand alone test case but in the real deal of doing torch compile the printf hack doesn’t print anything.

@ptrblck Should I open an issue on pytorch github, even though I don’t yet have a simple repro. I think I could debug this to root cause with a little help there. I have a lot of context that someone might recognize given I can print the inputs to triton_mm and I have the generated source. Currently my stand alone attempt uses torch random data of the same shape as I’ve found being passed in the REAL caller. But I’m not passing the exact same Tensor’s. If I knew how to export a Tensor in the big app, I could use them in my small test.

Or should this be a Triton discussion on github Triton???

But I’ll keep it here if you’d like me to wait till I have a repo.

Yes, I think creating a GitHub issue in the PyTorch repo is a good idea (please tag me there or post the link here once it’s done), as we could continue debugging it there.
Also, I agree with your feedback on the bad debugging experience, but unfortunately I’m also not familiar enough in the Triton stack yet to be able to point to to an easy way of isolating these memory violations.

Will do tomorrow. Need to sleep. Thanks.

1 Like

@aifartist there is a workaround to print stuff in triton and here’s a relevant issue if you’d like something better How to debug kernels · Issue #517 · openai/triton · GitHub