Non-deterministic behavior in Custom CUDA extension

Hey there!

Found a strange reproducibility issue in a custom CUDA extension.
[The forward pass as well as backward pass work; tested against a numerical (slow) torch implementation of our custom layer. ]
The issues are:

  • The backward pass for the weights only gives the correct results for batch_size =1. Else-wise, it randomly produces the correct numerical results.

  • The error to the correct results is a multiple of the previous result for batch_size=1. My guess: the CUDA Kernel is accessing memory pointers multiple times. Is there an issue with my thread/block indexing?

  • Most notable the non-deterministic behavior: when I execute the code repeatedly the results change for the same inputs. Again the shifts are multiples of the results for batch_size =1 for the given index of d_weights respectively.

Here is the relevant backward pass code. The entire code is more involved in particular for the backward pass for the inputs. However, the d_weights returned for this function do not get modified anywhere else.

//batch index
const int n = blockIdx.y;
// column index
const int c = blockIdx.x * blockDim.x + threadIdx.x;
const int state_size = input.size(1);
const int batch_size = input.size(0);
if (c < state_size){
if (n < batch_size){ for (int i = 3cstate_size; i < 3*(c+1)*state_size; i += 1){
d_weights[i] += (input[n][idxs[0][i]]-input[n][idxs[1][i]])*grad_out[n][c];

The workaround I can think of rely on extending the tensors by another dimensions of length = state_size; Which is however what I want to prevent was it will cause memory shortages.

The solution is to use atomic_add:

atomicAdd(&d_weights[i], (input[n][idxs[0][i]]-input[n][idxs[1][i]])*grad_out[n][c]);