So i’m implementing 2D pose estimation CNN based on OpenPose paper. CNN’s input and outputs are images (eg. image -> confidence maps).
I found it cool that PyTorch can handle variable-sized input images (as long as the sizes in one batch are the same) and outputs corresponding variable-sized confidence maps (again, equally sized for one batch). That suits my needs but what are some of the potential problems of training such a network. Namely i’m worried of the effects that might have on gradients. Bigger images mean more convolutions… does that equal higher gradient steps for bigger images or not?