Hi,
Many torchvision models are also available in a fully trained mode.
As it is not obvious, I’d like to know what’s the input format that the models are trained on.
I’m assuming it’s color images so 3 channel tensors, but is it RGB or BGR (like in OpenCV \ cv2)?
Also what’s the range values? Is it uint8 ([0:255]) or a float type?
If it’s float, is it normalized to [0:1] or to [-1:1]?
Please specify these details so it would be easy to reproduce the results.
All pre-trained models expect input images normalized in the same way, i.e. mini-batches of 3-channel RGB images of shape (3 x H x W), where H and W are expected to be at least 224. The images have to be loaded in to a range of [0, 1] and then normalized using mean = [0.485, 0.456,0.406] and std = [0.229, 0.224, 0.225] .
Not sure, if [-1:1] is preferred over [0:1], but the implemented normalization standardizes the data, such that it has a zero mean and a unit variance (sometimes known as z-score).
@ptrblck What does one do if you have to export the model to onnx ? Can one just assume the batch size for a model? and specify the batch size later? or is it possible to not mention the batch size while transferring it to ONNX?
You could manually pass the image_mean and image_std to the model creation based, if you’ve calculated them manually on your datase.
If you don’t pass them, these values will be used, which apparently are the ImageNet stats.
For detection models we do not use any normalization. I don’t know why not and can’t comment if this would be beneficial or not. You’ll have to ask @fmassa about that
Thank you. Indeed I tried inference on a pretrained model with no normalization and it worked. Some notes though:
Consider explaining why the pretrained detection models are the only ones that don’t require image normalization (I understand that the training set was not normalized. But again, why?)
Worth mentioning that no normalization is needed. The classification, segmentation and detection pretrained models are trained on ImageNet, so one may think all of them require ImageNet normalization, when in fact only the classification and segmentation models require normalization.
Perhaps it’s best to put this info in a table, since the pretrained video models also have a normalization, but different.
The the sake of completeness:
Based on Francisco’s answer, the normalization is used internally in these lines of code, which is using the GeneralizedRCNNTransform from my previous link.