GPU inference goes way off after first iteration; possibly tensor/transpose() async issue

I have a complex vision model. When doing inference on CPU, everything is fine but inference on GPU only does well for the first pass (image) and then all following passes (images) go a way off and produce garbage results. The problem is not in data. When I drop the first image that does well, then the new first (which previously was second image and did not do well) gets predicted well. After couple of days of debugging I have pinpointed the single line of code where the divergence from normal execution first occurs. But I cannot figure out why, except that it is perhaps something to do with asynchronous execution on GPU, and possibly tensor.transpose() that is used on this line. I have tested it on several GPU-s, all the same. Using Pytorch 1.2, Python 3.6 and Cuda 10.

Here is the line of code that first diverges:

    # real size: torch.Size([1, 128, 10, 400, 352])
    dense = data.new_empty((batch_size, 128, grid_size[1], grid_size[2], grid_size[0]))
    # copy data from sparse tensor to new tensor
    # THIS IS WERE THE OUTPUT IS DIFFERENT ON GPU from 2nd pass
    # dense (CPU) == dense (GPU) on first pass, but differ on each following pass
    dense[:, :, coords[:,:,1], coords[:,:,2], coords[:,:,0]] = data.transpose(0,2)

The shape of coords is [1, N, 3], and shape of data is [N, 1, 128] - and N is different for each image (pass). Coords is used to copy data to correct location in a new tensor. Might variable size of N be a problem?

I have tried to copy_() the data tensor to intermediate variable before building dense, but no luck. The dense tensor content is same for the first image on CPU and GPU. But from second image onward it diverges, on CPU it predicts well, but GPU gets different results (out of 180M values 155K were different using very loose 1e-04 tolerance). This does not even involve any learning, just tensor creation and transpose().

I have of course called the model in eval() mode and even checked that dense.requires_grad == False on each iteration for dense tensor.

Well, I am totally out of ideas where to look next. I might work around it somehow (initialise a new model each time), but I suspect that this might also affect training and I’d like to understand the cause.

OK, I have now solved the issue in a sense that I know what caused the issue and fixed it. I created a new tensor using .new_empty() which worked fine on CPU where I developed the code and also on the first iteration on GPU. But this was the source of the problem. When I changed the code to use .new_zeros() instead, the problem disappeared and inference is fine both on CPU and GPU.

If someone can give an explanation on why this happens internally, I would be grateful.

PS! If someone wonders how I located the problematic code lines then I described it in this comment.