Optimizing speed of specific CNN (repetitive procedure)

mflova · June 30, 2020, 2:34pm

Hi!
I am currently implementing the Monte Carlo sampling in order to calculate not only the output of the network but also its variance. To do so, it is needed to run the CNN with dropout enabled and input the same image like 20 times (which is, of course, 20 time slower compared to typical method). These outpus then are processed to obtain the mean and variance of the prediction.Right now the idea is working, but it takes 0.5s/image with following execution:

outputs = [net(inputs) for i in range(20)]

Is there a faster way to execute this? I thought about building batches with 20 images repeated, but I guess that the dropout will be the same for rsch batch, obtaining 20 times the same output per batch.
Maybe anything related to parallelism?

ptrblck · July 2, 2020, 10:12am

nn.Dropout will apply a mask to each sample separately as seen here:

drop = nn.Dropout()
x = torch.ones(20, 2)
out = drop(x)
print(out)

However, if you are using a batched input, the batchnorm statistics would be different (in case you are using batchnorm layers).

mflova · July 2, 2020, 11:07am

Oh, that information was very useful, I didn’t know how to check it out at all.
So supposing that I am obtaining the samples one by one with an iterator:

for i in range(loops):
try:
inputs, targets = next(dataloader_iterator)
except StopIteration:
dataloader_iterator = iter(dataloader)
inputs, targets = next(dataloader_iterator)
      inputs, targets = inputs.to(device), targets.to(device)

Given a single sample per iteration [target(i) and inputs(i)], which one is the fastest way to build a batch with these x20 inputs(i) and x20 targets(i)? This is my last step but I am not able to create it in an efficient way
Thank you very much

ptrblck · July 3, 2020, 6:54am

If you use batch_size=20 in your DataLoader, the created outputs should already be batches with 20 samples.

Please let me know, if I misunderstood the question.

mflova · July 3, 2020, 11:31am

Sorry I didn’t explain so well then.
In a real world situation, each frame will be recorded by a camera. In terms of processing, I would like to create a batch with N times this same frame in order to feed this entire block into the network (with dropout enabled) and take advantage of parallelims of Cuda API:

outputs = net(batch_frame)

I have made some tests and the upper quote seems to be faster than the original approach (below):

outputs = [net(frame) for i in range(N)]

In which the frame is feeded N times into the network (but this time not in parallel).
The problem is that I don’t know how I can create a batch with N times the same image

ptrblck · July 4, 2020, 5:34am

Assuming that each image tensor has the shape [channels, height, width], you could use expand or repeat to create a tensor with the shape [20, channels, height, width]:

x = torch.randn(3, 224, 224)

x = x.unsqueeze(0)
x1 = x.expand(20, -1, -1, -1)
print(x1.shape)
> torch.Size([20, 3, 224, 224])

# or
x2 = x.repeat(20, 1, 1, 1)
print(x2.shape)
> torch.Size([20, 3, 224, 224])

The former case would only manipulate the meta data (stride and shape), while the latter would trigger a copy.