Libtorch spends lots of time once changing the input size

Env: C++, cuda9, self compiled libtorch from souce(1.1.0a0+deadf3b), tracing model
Problem: libtorch spends lots of time in reloading the model if I change the input size during inference(In fact, it seems the same in python). If I don’t change the input size, it will get fast inference after the first time-consuming input. What can I do to accelerate the inference if I need to change the input size all the time?

what do you mean by changing the input size?

When I do inference, the size of each test input is different. For example, first one is [1, 3, 512, 960], then second one may be [1, 3, 1440, 1000]…

I tried to use c10::cuda::CUDACachingAllocator::emptyCache(); to see whether it may make some differences. But it did not help.

I also found it would occupy more gpu memory using c++ than using python. Hoping to know the reason.

hm that seems unlikely as the c++ api is really just that. The code it interfaces with is just the same that the python api uses. Does the python equivalent of your code also slow down once you change the input sizes?

Yes, it is! And I found c++ would use more gpu memory and inference time than python.

do you have cudnn benchmarking mode turned on? (torch.backends.cudnn.benchmark=True in python, not sure about C++ apis). If so, pytorch would search for the fastest convolution algorithm for each input size and that can slow you down.

Thanks for your reply. I have tried torch.backends.cudnn.benchmark=True/False on both c++ and python. I find it makes no difference whether the benchmark value is set to True or False.

I also find that if an input image with a typical size has been inferred once, then the model can make fast inference when it encounters the input image with the same size.

–debug-- input_concat shape: torch.Size([1, 3, 2272, 1280])
cost: 4.270411252975464
–debug-- input_concat shape: torch.Size([1, 3, 2272, 1280])
cost: 0.027519702911376953
–debug-- input_concat shape: torch.Size([1, 3, 2272, 1280])
cost: 0.058194875717163086
–debug-- input_concat shape: torch.Size([1, 3, 2272, 1280])
cost: 0.04706835746765137
–debug-- input_concat shape: torch.Size([1, 3, 2272, 1280])
cost: 0.048264265060424805
–debug-- input_concat shape: torch.Size([1, 3, 2272, 1280])
cost: 0.04921388626098633
–debug-- input_concat shape: torch.Size([1, 3, 1696, 1280])
cost: 3.20564603805542
–debug-- input_concat shape: torch.Size([1, 3, 1712, 1280])
cost: 3.278776168823242
–debug-- input_concat shape: torch.Size([1, 3, 1696, 1280])
cost: 0.03748798370361328
–debug-- input_concat shape: torch.Size([1, 3, 1280, 1696])
cost: 3.295097589492798
–debug-- input_concat shape: torch.Size([1, 3, 1712, 1280])
cost: 0.032279014587402344
–debug-- input_concat shape: torch.Size([1, 3, 1712, 1280])
cost: 0.05348610877990723
–debug-- input_concat shape: torch.Size([1, 3, 1712, 1280])
cost: 0.04692506790161133
–debug-- input_concat shape: torch.Size([1, 3, 1712, 1280])
cost: 0.03901052474975586
–debug-- input_concat shape: torch.Size([1, 3, 2272, 1280])
cost: 0.05834245681762695
–debug-- input_concat shape: torch.Size([1, 3, 1696, 1280])
cost: 0.034003496170043945
–debug-- input_concat shape: torch.Size([1, 3, 1696, 1280])
cost: 0.04744458198547363
–debug-- input_concat shape: torch.Size([1, 3, 1712, 1280])
cost: 0.04431772232055664
–debug-- input_concat shape: torch.Size([1, 3, 2272, 1280])
cost: 0.06284832954406738
–debug-- input_concat shape: torch.Size([1, 3, 1712, 1280])
cost: 0.04384565353393555
–debug-- input_concat shape: torch.Size([1, 3, 1280, 1712])
cost: 2.316293716430664
–debug-- input_concat shape: torch.Size([1, 3, 1712, 1280])
cost: 0.027036666870117188
–debug-- input_concat shape: torch.Size([1, 3, 1712, 1280])
cost: 0.050749778747558594
–debug-- input_concat shape: torch.Size([1, 3, 1280, 1696])
cost: 0.047841787338256836
–debug-- input_concat shape: torch.Size([1, 3, 1280, 1952])
cost: 3.9861865043640137
–debug-- input_concat shape: torch.Size([1, 3, 1712, 1280])
cost: 0.027935028076171875
–debug-- input_concat shape: torch.Size([1, 3, 1280, 1280])
cost: 2.379218101501465
–debug-- input_concat shape: torch.Size([1, 3, 1696, 1280])
cost: 0.02846550941467285
–debug-- input_concat shape: torch.Size([1, 3, 1712, 1280])
cost: 0.04407143592834473
–debug-- input_concat shape: torch.Size([1, 3, 2272, 1280])
cost: 0.06075263023376465
–debug-- input_concat shape: torch.Size([1, 3, 1920, 1280])
cost: 3.509601593017578
–debug-- input_concat shape: torch.Size([1, 3, 1712, 1280])
cost: 0.021055936813354492
–debug-- input_concat shape: torch.Size([1, 3, 1696, 1280])
cost: 0.03338050842285156
–debug-- input_concat shape: torch.Size([1, 3, 1712, 1280])
cost: 0.04442191123962402
–debug-- input_concat shape: torch.Size([1, 3, 1712, 1280])
cost: 0.04460453987121582
–debug-- input_concat shape: torch.Size([1, 3, 1712, 1280])
cost: 0.03949451446533203
–debug-- input_concat shape: torch.Size([1, 3, 1712, 1280])
cost: 0.052010297775268555
–debug-- input_concat shape: torch.Size([1, 3, 1280, 1696])
cost: 0.050638675689697266
–debug-- input_concat shape: torch.Size([1, 3, 1712, 1280])
cost: 0.058187246322631836
–debug-- input_concat shape: torch.Size([1, 3, 1712, 1280])
cost: 0.054537296295166016
avg cost: 0.6914166688919068

I have the same problem that libtorch uses much more GPU memory than python with the same image size. Do you have deal with it? Thank you.

Actually, I have not figured it out yet.

Putting torch::NoGradGuard no_grad; in the scope is the first solution, haven’t found much more that works yet tho… Would love to be able to easily quantisize the model when tracing the model to C++