The transforms for ConvNext all reuse ImageClassification as seen here which accepts a PIL.Image and will scale it to [0, 1] first and then normalize it as described here so I assume the input can be a pure uint8PIL.Image in [0, 255] assuming you are using the predefined transformation.
My use-case is applying the ConvNext encoder for segmentation. I have been providing input in the [-1, 1] range and it works quite well, but now that you pointed me to the normalization constants I’ll try the usual scale to [0, 1] and then normalize approach and let you know if I get better results.
It will take quite awhile before I can comment about the final metrics, but the training and validation loss are now decreasing faster than they were before. So it seems that identifying the right normalization range has been useful. Thanks again.