FasterRCNN models with single channel input (surprisingly?) work

I’ve been using the pre-trained FasterRCNNResnet50FPN model for object detection and noticed a behavior that was unexpected (at least to me). If one inferences with a single channel image (e.g. grayscale) the normalization process subtracts a 3-channel mean (RGB) and the result is a 3-channel tensor–so inference works (because the ResNet50 backbone expects a 3-channel input). I understand that the way Python broadcasts arrays (tensors) is what causes the shape to be altered in this way. For reference, it is this line that causes the reshaping.

My question is: is this reshaping expected behavior? I mean that in the sense of, if I input a grayscale image, would I expect the model to implicitly make it RGB (by duplicating channels and subtracting the RGB mean)? or should I expect the model to just “crash” (and complain that the input shape is not correct)? Or at least provide a warning–that an input shape was implicitly resized?

I personally am a big fan of loud and proper error messages, which would also be raised in case you are using the torchvision.transforms.Normalize transformation, since it’s using an inplace op as seen here.
I’m unsure why the detection.transform implementation doesn’t use torchvision.transforms.functional.normalize to get the same behavior and avoid code duplication, but it might have been used exactly to add this type of “flexibility” so that users are able to pass grayscale images.

CC @fmassa is my speculation correct or should these transformations be rather “unified”?