Gpu replacement operation seems unreasonably slow?

I have the following operation in my code:

result[mask] = aa_select[mask]

where everything is living on the gpu already:

result
tensor([[ 0, 20, 32,  ..., 18, 14,  2],
        [ 0, 20, 19,  ..., 18, 14,  2],
        [ 0, 20, 19,  ..., 18, 14,  2],
        ...,
        [ 0, 20, 19,  ..., 18, 14,  2],
        [ 0, 32, 19,  ..., 18, 14,  2],
        [ 0, 20, 19,  ..., 18, 32,  2]], device='cuda:0')
mask
tensor([[False, False,  True,  ..., False, False, False],
        [False, False, False,  ..., False, False, False],
        [False, False, False,  ..., False, False, False],
        ...,
        [False, False, False,  ..., False, False, False],
        [False,  True, False,  ..., False, False, False],
        [False, False, False,  ..., False,  True, False]], device='cuda:0')
aa_select
tensor([[ 0, 20, 15,  ..., 18, 14,  2],
        [ 0, 20, 19,  ..., 18, 14,  2],
        [ 0, 20, 19,  ..., 18, 14,  2],
        ...,
        [ 0, 20, 15,  ..., 18, 14,  2],
        [ 0, 20, 19,  ..., 18, 14,  2],
        [ 0, 20, 19,  ..., 18, 15,  2]], device='cuda:0')

Now the shape of these tensors are (100,244), so they are not unreasonably big.
However the operation I’m doing at the top, for some reason takes ~3.5 seconds to complete, which seems completely unreasonable.

Does anyone have any clue what is going on here? I’m guessing the operation is somehow not allowed on the gpu so it is moved to the cpu which I guess could explain how it takes this long. (For comparison I got my results by running data through my neural network with 650M parameters and that happened in 0.04s which highlights how absurdly expensive the 3.5 seconds are)

How can I fix this?