I have a big 1D tensor `A`

, which contains around 20M elements. I also have some spans with unequal lengths, i.e., `B=[(s_1, e_1), (s_2, e_2), ..., (s_n, e_n)]`

, where `n`

may be more than 8K. The one-time slicing `A[s:e]`

is very fast, but slicing for all spans in `B`

by `for loop`

is very time consuming. Is there any way to slice parallelly on gpu? My torch version is 1.8.1, and some operations like `map_()`

and `apply_()`

are only available on CPU.

For example:

```
A = torch.tensor([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
B = torch.tensor([[0, 3], [6, 8]])
C = UnkownFuction(A, B)
```

`C`

is also a 1D tensor tensor `[1, 2, 3, 4, 7, 8, 9]`

Thanks for your kind help in advance!