Performance difference of mask as index or multiplication with mask

Hey,

Im curious as to if there is a performance difference between these two:

mask = ...
target = ...
pred = ...
loss = torch.sum(mask * (target - pred) ** 2)
loss.backward()

and

mask = ...
target = ...
pred = ...
loss = torch.sum((target[mask] - pred[mask]) ** 2)
loss.backward()

Code would mostly be running on GPU.

1 Like

So, have you tried?

mask = torch.randn(100,100,100, device='cuda')
mask2 = mask>0

target = torch.randn(100,100,100, requires_grad=True, device='cuda')
pred = torch.randn(100,100,100, requires_grad=True, device='cuda')

def fn1():
    loss = torch.sum(mask * (target - pred) ** 2)
    loss.backward()
    torch.cuda.synchronize()
def fn2():
    loss = torch.sum((target[mask2] - pred[mask2]) ** 2)
    loss.backward()
    torch.cuda.synchronize()

torch.cuda.synchronize()

%timeit fn1()
%timeit fn2()

for me, the first variant is ~4 times as fast. That isn’t terribly surprising considering that multiplication of dense matrices is much cheaper than assembling the mask-selected items in a new tensor. (An intermediate case would be to only use the mask once and take the difference on all.)
The first variant will be even faster when you get the JIT to fuse the two multiplications.
Finally a bit of warning: In the presence of NaN or Inf in the “masked away” part, the two will be different, as the first will give NaN. You could avoid that with torch.where, but that isn’t free, either.

Best regards

Thomas

3 Likes

Thanks for an excellent answer!

My only attempt was without synchronization as I forgot about the async nature of cuda and I ended up moving on.

NaN values should not be an issue in this case.

Thanks as well for the explanation, super helpful.

Once I have more time I will try to test the difference in runtime on the project and give an update in this thread.