I have a cuda kernel for raycasting where one of the dimensions of the grid involves comparing a computed minimum to the “current minimum” value, and replacing it if smaller.
The cuda kernel is blazing fast, and I’m really happy it works but I realized there is a race condition in there, and it is evident in the graphic output. I’m wondering if PyTorch has any atomic operations I could use on my PackedAccessor32 object?