In device or global function of GPU, it is hard to print out, we can use printf(%d",a); but this can not print out torch.tensor, because they are not one single value. So now I am using this code to print out:
for (int i=0;i<new_cell.size(0);i++ )
for (int j=0;j<new_cell.size(1);j++)
Seems crazy… Any better suggestions?
Also I noticed a cuda debug tool Nsight, NVIDIA Nsight Integration | NVIDIA Developer But not sure how to combine with .cu file with pytorch…Because here we are actually calling py file to run compiled cu file…?