That’s right. For CPU it uses the numpy implementation, which might be a bit slower.
However, the difference between the current torch implementation between CPU and GPU tensors is quite large, so I would like to see, if something changed internally, since the code was timed before being released.