Input format for pretrained torchvision models

Many torchvision models are also available in a fully trained mode.
As it is not obvious, I’d like to know what’s the input format that the models are trained on.
I’m assuming it’s color images so 3 channel tensors, but is it RGB or BGR (like in OpenCV \ cv2)?
Also what’s the range values? Is it uint8 ([0:255]) or a float type?
If it’s float, is it normalized to [0:1] or to [-1:1]?
Please specify these details so it would be easy to reproduce the results.


From the docs:

All pre-trained models expect input images normalized in the same way, i.e. mini-batches of 3-channel RGB images of shape (3 x H x W), where H and W are expected to be at least 224. The images have to be loaded in to a range of [0, 1] and then normalized using mean = [0.485, 0.456,0.406] and std = [0.229, 0.224, 0.225] .

Thank you for the detailed answer!

Hi @ptrblck, I was wondering, isn’t it preferred to normalize the data to the [-1:1] range instead of the [0:1] range?

Not sure, if [-1:1] is preferred over [0:1], but the implemented normalization standardizes the data, such that it has a zero mean and a unit variance (sometimes known as z-score).

@ptrblck What does one do if you have to export the model to onnx ? Can one just assume the batch size for a model? and specify the batch size later? or is it possible to not mention the batch size while transferring it to ONNX?

You could use the dynamic_axes argument to specify that the batch size is not fixes as used in this tutorial.

@ptrblck thanks a lot. That helped!


In the docs, mean and STD values are specified for classification, segmentation, and video classification.
They are not specified for object detection models:

Can we get their values as well?

You could manually pass the image_mean and image_std to the model creation based, if you’ve calculated them manually on your datase.
If you don’t pass them, these values will be used, which apparently are the ImageNet stats.

CC @pmeier for more information.


For detection models we do not use any normalization. I don’t know why not and can’t comment if this would be beneficial or not. You’ll have to ask @fmassa about that :wink:

Thank you. Indeed I tried inference on a pretrained model with no normalization and it worked. Some notes though:

  1. Consider explaining why the pretrained detection models are the only ones that don’t require image normalization (I understand that the training set was not normalized. But again, why?)
  2. Worth mentioning that no normalization is needed. The classification, segmentation and detection pretrained models are trained on ImageNet, so one may think all of them require ImageNet normalization, when in fact only the classification and segmentation models require normalization.
    Perhaps it’s best to put this info in a table, since the pretrained video models also have a normalization, but different.

Those are valid points. Do you want to open an issue for that or should I do it in your name?

I would appreciate if you could do it. I’m mattans on GitHub. Thanks!

For reference:

The the sake of completeness:
Based on Francisco’s answer, the normalization is used internally in these lines of code, which is using the GeneralizedRCNNTransform from my previous link.