The difference of ~1e-6 is within the numerical accuracy of floating point operations that are of the result size, so this magnitude of difference is â€śnormal operationsâ€ť.
Dividing by a scalar will use a different kernel than dividing by a cuda tensor. This is speculation, but one difference between the two is that the first will have the 100 passed around as fp64 before moving to fp32 on the GPU, maybe that makes some rounding difference.