Fast masking/row-selection without copying

My training loop looks like the following

for X, mask in dataloader(...):
    X = X[mask]  # only a small subset (of rows) of X is good for training, very slow
    X_cuda = X.cuda(non_blocking=True)
    prediction = model(X_cuda)

My X is very large. The masking/row-selection is taking a lot of time (because it copies the data instead of using shared storage) in each cycle so I cannot sufficiently use my GPU.

Is there a way to do X[mask] that avoids data copying?

I don’t think that’s possible as the number of output elements depends on the mask and is thus unknown before the actual values are available. I.e. you won’t be able to preallocate a tensor in a specific shape unless you waste memory use use the max. number of elements (X.nelement()).

Can X[mask] return a view or something?
The following code returns a view instead of a copy right?

a, b = get_range(...)
view = X[a: b]

No, I don’t think masking can create a view, since the masked indices can be random.
Yes, slicing the tensor will create a view, so if your mask uses a specific stride/indexing logic you might want to convert it to an indexing operation.