Lu_unpack source code

Is lu_unpack (src) as efficient as possible? For instance let’s have a look at this part:

P = torch.eye(sz, device=LU_data.device, dtype=LU_data.dtype)
final_order = list(range(sz))
for k, j, in enumerate(LU_pivots_zero_idx):
    final_order[k], final_order[j] = final_order[j], final_order[k]
P = P.index_select(1, torch.as_tensor(final_order, device=LU_pivots.device))

Why isn’t final_order allocated on LU_pivots.device from the start? Doesn’t final_order[j] forces a sync if LU_pivots is on a GPU?

I can’t edit the text of my question anymore. Maybe I sounded too critical. I was just wondering whether there was a particular reason the author chose to use a Python list for final_order instead of a tensor allocated on the same device as LU_pivots.
It’s my understanding that most(?) of PyTorch commands are asynchronous, that is, the command returns right away and doesn’t wait for the operation to complete. If we use the computed value on the Python-side, though, the Python code stops and waits for the value to be available before proceeding to the next statement.
So I think that if final_order were a PyTorch tensor then the whole piece of code above would be executed asynchronously without any slowdown. Also, the last line wouldn’t need to transfer data between devices.