The target tensor is X (B, H, W, C). The index tensor is I (B, N, 2). And the source tensor is S (B, N, C).

So how can I implement X[i, I[i,j,0], I[i,j,1]] = S[i, j] efficiently and parallelly?

I tried to use `for`

loop to assign X. But it took too much time in `loss.backward()`

.