Running time and BP under condition

I want to train a network but apply the gradient only for images (out of a minibatch) that their output is under some certain condition - for example, the maximal value of the output for given image is above some threshold.

Before trying to solve this I tried to feedforward a minibatch of 100 images and calculated the time it took to compute ff and bp and I got something like 25 seconds while for 10 images it took 7 seconds

but when I tried to do the following (taking only 10 outputs out of the 100) it still took 25 seconds

output = model(images)

output = output[0:10,:]
labels = labels[0:10,:]

loss= criterion(output, labels)


How are you doing the timing here (e.g., with appropriate torch.cuda.synchronize() calls before timing starts and and stops if the execution is happening on GPU)

Note that it could be also expected that the time is not changing substantially if the bottlenecks lie elsewhere or the hardware is not being fully utilized.

I use time.time() before and after the backprop part.
I don’t think it lie somewhere else …

I think there is something that does a matrix multiplication of sizes [ 100,10] instead of [10,10]
in the backprop.

Maybe some placeholder is created after the forward ?

These problem sizes are small enough that they are essentially “free” compared to the cost of launching the kernels and dispatching from Python.
For example, on an A6000, we need around 1e5 iterations to even get a stable measurement of the time per iteratIon:

# cat
import time
import torch

iters = 100000

a = torch.randn(10, 10, device='cuda')
b = torch.randn(100, 10, device='cuda')

t1 = time.perf_counter()
for _ in range(iters):
  torch.matmul(a, a)
t2 = time.perf_counter()

print(f"10,10 x 10,10 took {t2-t1}, {(t2-t1)/iters} per iter")

t1 = time.perf_counter()
for _ in range(iters):
  torch.matmul(b, a)
t2 = time.perf_counter()

print(f"100,10 x 10,10 took {t2-t1}, {(t2-t1)/iters} per iter")
# python
10,10 x 10,10 took 0.6885399222373962, 6.885399222373963e-06 per iter
100,10 x 10,10 took 0.6861815741285682, 6.861815741285682e-06 per iter

I would check if you see the same behavior on a larger model or with a greater difference in input sizes.

I don’t think my problem relates to cuda runtimes at all.

Let me try to define my problem differently -

Assume I want to do BP only on 10 images out of the 100 images minibatch - how can I do it efficiently ?

Does anyone have any suggestions?