I have a complex vision model. When doing inference on CPU, everything is fine but inference on GPU only does well for the first pass (image) and then all following passes (images) go a way off and produce garbage results. The problem is not in data. When I drop the first image that does well, then the new first (which previously was second image and did not do well) gets predicted well. After couple of days of debugging I have pinpointed the single line of code where the divergence from normal execution first occurs. But I cannot figure out why, except that it is perhaps something to do with asynchronous execution on GPU, and possibly tensor.transpose() that is used on this line. I have tested it on several GPU-s, all the same. Using Pytorch 1.2, Python 3.6 and Cuda 10.
Here is the line of code that first diverges:
# real size: torch.Size([1, 128, 10, 400, 352])
dense = data.new_empty((batch_size, 128, grid_size[1], grid_size[2], grid_size[0]))
# copy data from sparse tensor to new tensor
# THIS IS WERE THE OUTPUT IS DIFFERENT ON GPU from 2nd pass
# dense (CPU) == dense (GPU) on first pass, but differ on each following pass
dense[:, :, coords[:,:,1], coords[:,:,2], coords[:,:,0]] = data.transpose(0,2)
The shape of coords is [1, N, 3], and shape of data is [N, 1, 128] - and N is different for each image (pass). Coords is used to copy data to correct location in a new tensor. Might variable size of N be a problem?
I have tried to copy_() the data tensor to intermediate variable before building dense, but no luck. The dense tensor content is same for the first image on CPU and GPU. But from second image onward it diverges, on CPU it predicts well, but GPU gets different results (out of 180M values 155K were different using very loose 1e-04 tolerance). This does not even involve any learning, just tensor creation and transpose().
I have of course called the model in eval() mode and even checked that dense.requires_grad == False on each iteration for dense tensor.
Well, I am totally out of ideas where to look next. I might work around it somehow (initialise a new model each time), but I suspect that this might also affect training and I’d like to understand the cause.