GPU inference goes way off after first iteration; possibly tensor/transpose() async issue

OK, I have now solved the issue in a sense that I know what caused the issue and fixed it. I created a new tensor using .new_empty() which worked fine on CPU where I developed the code and also on the first iteration on GPU. But this was the source of the problem. When I changed the code to use .new_zeros() instead, the problem disappeared and inference is fine both on CPU and GPU.

If someone can give an explanation on why this happens internally, I would be grateful.

PS! If someone wonders how I located the problematic code lines then I described it in this comment.