OK, I have now solved the issue in a sense that I know what caused the issue and fixed it. I created a new tensor using .new_empty()
which worked fine on CPU where I developed the code and also on the first iteration on GPU. But this was the source of the problem. When I changed the code to use .new_zeros()
instead, the problem disappeared and inference is fine both on CPU and GPU.
If someone can give an explanation on why this happens internally, I would be grateful.
PS! If someone wonders how I located the problematic code lines then I described it in this comment.