Variable input size effects on gradients

So i’m implementing 2D pose estimation CNN based on OpenPose paper. CNN’s input and outputs are images (eg. image -> confidence maps).

I found it cool that PyTorch can handle variable-sized input images (as long as the sizes in one batch are the same) and outputs corresponding variable-sized confidence maps (again, equally sized for one batch). That suits my needs but what are some of the potential problems of training such a network. Namely i’m worried of the effects that might have on gradients. Bigger images mean more convolutions… does that equal higher gradient steps for bigger images or not?

It should affect at the inference time if your network has been properly trained and the architecture is not very restrictive. Plenty of them has an open size.

Think that, even if images are bigger, you are applying exactly the same kernel over them. You get bigger outputs but it shouldn’t affect convolutions. Another discussion would be if you have fixed-size latent spaces or poolings.