How to control tensor size inside of functional files in C++


When I tried to get result from two separated datasets created with 128x128 in first and 256x226 resolutions in second, I obtained to same prediction durations for each sets. However this only applies to GPU usage. In CPU, the durations are more rational.

I have investigate the source code written just Python, not C++. And I saw that each layer results of a model is expected sizes. So PyTorch doesn’t change input sizes with a constant size amount from beginning of a prediction process to its end or layers. By reason of this, I think that input sizes can be changed in C++ side, for example in Convolution progress.

I wonder that how to break this case, and starting to get rational (or expected) prediction durations which are changes in accordance with input sizes?

Thanks for your attentions, have a good day!

Or it can be nice to know that why I get same prediction durations for different sized datasets, using GPU?

Could you post the code you’ve used to profile the workload, please?

@ptrblck thanks for your reply.

I guess you don’t need to a code to understand. However, probably, I didn’t explain myself clearly.
I have prepared a code which will show you what’s going on. Data set is the major difference between my project and the code which I have prepared for you. On the other hand, the general concept is same.

You can find it here:


As you can see, we obtain same prediction duration using GPU, but not with CPU. I wonder reasons of same prediction duration for GPU using different input sizes.

Thanks for the code!
Your profiling is unfortunately wrong, since GPU operations are executed asynchronously and you would thus need to synchronize the device before starting and stopping the timers:

    for i in range(loop_count):
      outset = time.time()

via torch.cuda.synchronize().
Otherwise you would profile the dispatching, kernel launches, etc. unless you are (accidentally) synchronizing the code already.

Great! I overlooked whether a prediction process can work asynchronously and it can be stopped with a signal. However, it doesn’t give the wanted result I have expressed. Maybe you can control the last calculations and suggest a different way. Because I expect that the calculation for 128x128 resolution should be completed in about 1/4 (fairly roughly) duration according to 256x256 resolution.

Here the code with your last suggestion:

> Colab/Same Prediction Durations with Different Input Sizes - with cuda.synchroniz.ipynb

Great! Your first synchronization is still wrong, since you would need to synchronize the code before starting the timer, not afterwards.
In any case, you should be careful with your expectation of seeing a perfect 1/4 of the time for a smaller input image since your overall use case is small and I doubt you are saturating your GPU at all. Thus your workload might already be CPU-limited which you could verify by profiling the code.
The actual kernel times might be significantly faster, but your timeline might just see more idle times in case the CPU isn’t fast enough in scheduling the work.

I understood. Thanks your your clues and the starting point related to profile. I believe that I can proceed more with your suggestions after my personal researches.
Have a nice day!

Sure! Let me know how it goes or if you get stuck somewhere.