Is lu_unpack (src) as efficient as possible? For instance let’s have a look at this part:
P = torch.eye(sz, device=LU_data.device, dtype=LU_data.dtype)
final_order = list(range(sz))
for k, j, in enumerate(LU_pivots_zero_idx):
final_order[k], final_order[j] = final_order[j], final_order[k]
P = P.index_select(1, torch.as_tensor(final_order, device=LU_pivots.device))
final_order allocated on
LU_pivots.device from the start? Doesn’t
final_order[j] forces a sync if
LU_pivots is on a GPU?
I can’t edit the text of my question anymore. Maybe I sounded too critical. I was just wondering whether there was a particular reason the author chose to use a Python list for
final_order instead of a tensor allocated on the same device as
It’s my understanding that most(?) of PyTorch commands are asynchronous, that is, the command returns right away and doesn’t wait for the operation to complete. If we use the computed value on the Python-side, though, the Python code stops and waits for the value to be available before proceeding to the next statement.
So I think that if
final_order were a PyTorch tensor then the whole piece of code above would be executed asynchronously without any slowdown. Also, the last line wouldn’t need to transfer data between devices.