Correct channel layout for PyTorch hub models?

Hi all,

I found several but unluckily contradicting infos about correct preprocessing of images for pretrained Torch models from the hub, available by the torchvision.models module.

The models come with a transforms() function that returns some information regarding the appropriate preprocessing. For example for a ResNet50 model, it returns:

>>> torchvision.models.resnet.Wide_ResNet50_2_Weights['IMAGENET1K_V1'].transforms()
    mean=[0.485, 0.456, 0.406]
    std=[0.229, 0.224, 0.225]

Here, I would like to find out: What is the expected channel layout per model, i.e. RGB or BGR for color images?

Unluckily, there are contradicting information available. As far as I know, Torch uses and expects RGB layout by default just as other Python libraries like Pillow or Tensorflow/Keras do. I found the statement in this forum that all pretrained models expect RGB layout (Link), however the posting contains a link to the documentation that is not working anymore, i.e. it is unclear whether this statement remains true.

On the other hand, I found the following statement in the current docs:

Before using the pre-trained models, one must preprocess the image (resize with right resolution/interpolation, apply inference transforms, rescale the values etc). There is no standard way to do this as it depends on how a given model was trained. It can vary across model families, variants or even weight versions. Using the correct preprocessing method is critical and failing to do so may lead to decreased accuracy or incorrect outputs.

Based on this, I can only conclude that the previous everything defaults to RGB is outdated and does not hold anymore. So how to find out, which channel layout is the right one?

I found no direct flag in the transfroms() return value and similarly don’t know whether this is available somewhere in the model. But I noticed that the given mean / standard deviation values are equal to the respective values computed on ImageNet. Still, the channel layout is unknown. According to the link above, these mean values should be RGB layout. But on the other hand, according to PyImageSearch, these values are BGR ones because mean * 255 = [123.675, 116.28, 103.53]. As most models share these values, I conjecture that they indeed are RGB and the explanation at PyImageSearch mistakes here between R and B values. Still, other libraries (e.g. OpenCV or Caffe) use BGR layout such that [0.485, 0.456, 0.406] still could indeed by BGR instead of RGB values. Besides this, in general the mean and standard deviation values can be arbitrary and are not constrained to equal those of ImageNet such that further information are required how they are interpreted.

Finally, I have contradicting infos and no clarification and thus would like to ask: How can I find out which channel layout (RGB or BGR) is the right one for which pretrained model?

I’m unsure where exactly in the link the BGR reference is mentioned, but the dog image mentions RGB. Also, the posted values are the stats taken from the RGB ImageNet data as seen in my linked answer and the older docs. Since torchvision still heavily depends on PIL to load and process images I would assume all stats are in RGB, but @pmeier can correct me if that’s not the case anymore.

Thanks for your comment! So at least most models expect RGB inputs. Still, the current docs (Link) don’t state that this is generally true and even tell me that there is no standard way to preprocess the images. If this does not cover the channel layout, too, and this is always are RGB, this should be stated explicitly somewhere I think. On the reverse, if there are models that expect BGR instead of RGB, there should be a standardized way how the right channel layout can be unambiguously identified. Or what do you think?

torchvision had at no point in time support for BGR images, or to make it more general, anything but RGB and (partially) grayscale images. Meaning, RGB is the right one. All of transforms and pretrained weights for the models assume that.

The part

There is no standard way to do this as it depends on how a given model was trained

refers what transforms have been used. And it is indeed correct that there is no standard way and in general each model uses something else.

It is still part of the API, but we recently made some advances such that our pure tensor API (minus the actual image loading from disk) is on par and in a lot of cases even faster than the PIL API.

Documentation can always be improved. Happy to review a PR if you like to send one. You can ping me there with the same user name, i.e. @pmeier.

Excellent, thanks for this clarification!