Is lu_unpack (src) as efficient as possible? For instance let’s have a look at this part:
P = torch.eye(sz, device=LU_data.device, dtype=LU_data.dtype)
final_order = list(range(sz))
for k, j, in enumerate(LU_pivots_zero_idx):
final_order[k], final_order[j] = final_order[j], final_order[k]
P = P.index_select(1, torch.as_tensor(final_order, device=LU_pivots.device))
Why isn’t final_order
allocated on LU_pivots.device
from the start? Doesn’t final_order[j]
forces a sync if LU_pivots
is on a GPU?
I can’t edit the text of my question anymore. Maybe I sounded too critical. I was just wondering whether there was a particular reason the author chose to use a Python list for final_order
instead of a tensor allocated on the same device as LU_pivots
.
It’s my understanding that most(?) of PyTorch commands are asynchronous, that is, the command returns right away and doesn’t wait for the operation to complete. If we use the computed value on the Python-side, though, the Python code stops and waits for the value to be available before proceeding to the next statement.
So I think that if final_order
were a PyTorch tensor then the whole piece of code above would be executed asynchronously without any slowdown. Also, the last line wouldn’t need to transfer data between devices.