Convolutional model yields different inputs on parts of image and whole image

Hi there,

I am working on a fully convolutional autoencoder, which I train on 256x256 patches of my 1024x1024 images using torchvision RandomCrop data augmentation method on the fly.

I see very different results when using my model to predict one 256x256 patch, versus the whole 1024x1024 source image. From my understanding the local weights of the convolutional kernels should learn to recognize the same patterns in both cases as the scale between the patches and full image is conserved.

Am I missing something?

Thanks in advance!


I’m not sure I’m understanding this claim correctly, but how would the kernels preserve the scale, if you increase the spatial resolution by 4x?

I mean if I take a 1024x1024 image, split it into 256x256 chunks and train on these chunks, running the resulting trained fully convolutional model on the full 1024x1024 model should work, as each feature map should be simply 4 times as wide and high as the ones during training, but that shouldn’t matter as at the kernel level (say 5x5) the resolution is still the same. I should get an output of size 1024x1024 (as I do get) instead of the 256x256 I have in training. No?

Thanks for clarifying the use case, as I’ve misunderstood it.
Yes, you should get the same output (besides the edges from the smaller patches due to padding).

Could you post the code you are using to recreate the larger output?