How to slice multiple spans from a big 1D tensor parallelly on GPU?

I have a big 1D tensor A, which contains around 20M elements. I also have some spans with unequal lengths, i.e., B=[(s_1, e_1), (s_2, e_2), ..., (s_n, e_n)], where n may be more than 8K. The one-time slicing A[s:e] is very fast, but slicing for all spans in B by for loop is very time consuming. Is there any way to slice parallelly on gpu? My torch version is 1.8.1, and some operations like map_() and apply_() are only available on CPU.

For example:

A = torch.tensor([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
B = torch.tensor([[0, 3], [6, 8]])
C = UnkownFuction(A, B)

C is also a 1D tensor tensor [1, 2, 3, 4, 7, 8, 9]

Thanks for your kind help in advance!