Cuda device-side assert when using `detatch()` in recursive loop

I am training a neural network that recursively applies a learned transformation on my input.
Since I do not want the backpropagation to propagate all the way back through the recursive application for performance reasons, I used something like this:

x = input
for i in range(applications): 
    output = transform(x)
    x = output.detach()

Using this results in a error in Cuda:

cudaEventSynchronize in future::wait: device-side assert triggered

When I add the flag CUDA_LAUNCH_BLOCKING=1 (as adviced here), the error changes to:

after cudaLaunch in triple_chevron_launcher::launch(): device-side assert triggered

Without the detach, no error occurs. The error does not occur at a fixed time. The network can successfully train for many batches until this occurs.

hmm, weird. I dont even know what triple_chevron_launcher is, and it’s weird that even with CUDA_LAUNCH_BLOCKING=1 it’s not giving a full stack-trace.