I don’t know what are fail conditions of diagonal(), maybe there are none.
To your second question, it is operation dependent, but generally - either contiguous() is done under the hood, or tensor iterators handle read pointers, or cuda kernels do the address arithmetic (and perform worse if different areas must be read). IIRC, torch’s operations very rarely refuse to work because of non-contiguity.
In most cases, I wouldn’t bother with explicit conitguous copies, unless a non-contiguous tensor is used multiple times, it is used in a heavy operation like matmul, or if stride() is ugly and the next operation takes a lot of extra time (as per profiler) because of it.