Hi all,
I found several but unluckily contradicting infos about correct preprocessing of images for pretrained Torch models from the hub, available by the torchvision.models module.
The models come with a transforms()
function that returns some information regarding the appropriate preprocessing. For example for a ResNet50 model, it returns:
>>> torchvision.models.resnet.Wide_ResNet50_2_Weights['IMAGENET1K_V1'].transforms()
ImageClassification(
crop_size=[224]
resize_size=[256]
mean=[0.485, 0.456, 0.406]
std=[0.229, 0.224, 0.225]
interpolation=InterpolationMode.BILINEAR
)
Here, I would like to find out: What is the expected channel layout per model, i.e. RGB or BGR for color images?
Unluckily, there are contradicting information available. As far as I know, Torch uses and expects RGB layout by default just as other Python libraries like Pillow or Tensorflow/Keras do. I found the statement in this forum that all pretrained models expect RGB layout (Link), however the posting contains a link to the documentation that is not working anymore, i.e. it is unclear whether this statement remains true.
On the other hand, I found the following statement in the current docs:
Before using the pre-trained models, one must preprocess the image (resize with right resolution/interpolation, apply inference transforms, rescale the values etc). There is no standard way to do this as it depends on how a given model was trained. It can vary across model families, variants or even weight versions. Using the correct preprocessing method is critical and failing to do so may lead to decreased accuracy or incorrect outputs.
Based on this, I can only conclude that the previous everything defaults to RGB is outdated and does not hold anymore. So how to find out, which channel layout is the right one?
I found no direct flag in the transfroms() return value and similarly don’t know whether this is available somewhere in the model. But I noticed that the given mean / standard deviation values are equal to the respective values computed on ImageNet. Still, the channel layout is unknown. According to the link above, these mean values should be RGB layout. But on the other hand, according to PyImageSearch, these values are BGR ones because mean * 255 = [123.675, 116.28, 103.53]
. As most models share these values, I conjecture that they indeed are RGB and the explanation at PyImageSearch mistakes here between R and B values. Still, other libraries (e.g. OpenCV or Caffe) use BGR layout such that [0.485, 0.456, 0.406]
still could indeed by BGR instead of RGB values. Besides this, in general the mean and standard deviation values can be arbitrary and are not constrained to equal those of ImageNet such that further information are required how they are interpreted.
Finally, I have contradicting infos and no clarification and thus would like to ask: How can I find out which channel layout (RGB or BGR) is the right one for which pretrained model?