How to print value within __device__ function?

In device or global function of GPU, it is hard to print out, we can use printf(%d",a); but this can not print out torch.tensor, because they are not one single value. So now I am using this code to print out:
for (int i=0;i<new_cell.size(0);i++ )
for (int j=0;j<new_cell.size(1);j++)
printf("%f\n", new_cell[i][j]);
Seems crazy… Any better suggestions?

Also I noticed a cuda debug tool Nsight, NVIDIA Nsight Integration | NVIDIA Developer But not sure how to combine with .cu file with pytorch…Because here we are actually calling py file to run compiled cu file…?