Non-optimal grayscale conversion algorithm in torchvision?

It seems that torchvision Grayscale uses the following formula to convert RGB images to grayscale:
L = R * 0.2989 + G * 0.5870 + B * 0.1140
Some other packages like opencv use pretty similar color conversion values.

These ratios to merge RGB into one channel come from the BT.601 standard. And so far as I understand, this standard was created for television and is based on the two major concepts:

  1. Human eye feels some colors better than others. For example, we can see more colors in the green color spectrum than in the red one.
  2. Different color representations in different TV systems. Specified RGB ratios are used in PAL/NTSC standard, but there are TV standards with the other ratios.

Torchvision Grayscale is typically used in deep learning, where we take information from camera sensors (i.e. images) and process it to get non-visual results. And (surprise!) none of these is related neither to the human eye, nor to ancient TV standards.

Every time we convert digital camera images from RGB to grayscale, we lose some information. And the more distortion we introduce to conversion algorithms, the more original information we lose.
It’s pretty obvious that applying any arbitrary coefficients while merging color channels leads to irreversible image data loss. And no doubt that standards that are classified as ‘perceptual luminance-preserving conversion’ are pretty arbitrary for most of the Machine Learning tasks.
So why should these RGB ratios still be a thing in Torch? Why not just manage all the colors equally?

I think it makes sense to apply the commonly used transformation methods in torchvision even if their reference implementation is dated. You are however absolutely correct that this transformation might not be ideal and you could easily create your own ToGrayscale transform using your coefficients.

Hi Cepera!

In addition to what @ptrblck said, I would note that if you are concerned
about the modest difference in loss of information between using a
“perceptual” conversion of RGB to grayscale vs. simply averaging the
three channels together, you should be much more concerned about
the dramatically larger loss of information that occurs when you combine
the three RGB channels into a single grayscale channel in the first place.

If you have RGB data to work with and such information loss is material,
you should use a model that accepts three (RGB) channels for its input.
If your model only accepts a single (grayscale) channel, it would be quite
straightforward to modify the first layer of the model to accept three
channels instead of only one.

You can look at this approach, roughly speaking, as letting your model
learn (via the three-channel weights of the first layer) the specific RGB-
to-grayscale conversion that is optimal for your particular problem (although
the RGB conversion being learnt is potentially more general than just a
simple weighted average).

On a more philosophical note, perhaps we shouldn’t be using RGB
cameras to capture our data – perhaps instead we should use infrared /
visible / ultraviolet cameras, or six-color cameras like some butterflies’
eyes, or four-color cameras like the color vision some people have.


K. Frank

1 Like

KFrank, my idea of using grayscale in my current project is that the images are from cctv cameras. Half of the time these cameras work in the nightmode which is grayscale only (IR). I need to detect events like people and opened doors on the images and most of these events happen in nightmode, i.e. are grayscaled. I get no efficacy improvements while using 3-color images, but the performance/memory issues only.

So yes, I get some loss of information on converting 3-channel image to grayscale, and I get some ‘loss’ on the fact that infrared-grayscaled do not fully correspond to daylight grayscaled. And I also get the loss on wrong implementation of the Grayscale function in torch.

The problem here is that most ppl in my situation understand first two sources of loss and try to deal with them somehow. BUT the 3rd source of loss from Grayscale function in torchvision isn’t obvious at all. It is not mentioned in the documentation, and there are no discussions about that in the internet, so it cannot be realized and corrected by the programmer.

That is why this situation looks like a bad software practice.

I would claim the opposite, i.e. it’s a good software practice to stick to standards.
I think more users would run into unexpected results when comparing the grayscale output of an RGB image using the torchvision.transforms vs. e.g. the OpenCV or PIL one and would expect to see the same (up to floating point precision) results.
So far your idea sounds interesting but I haven’t seen any verification from you or in the literature of your claim the same scaling for each channel might be advantageous. Could you post literature references for some experiments comparing this transform?

I said that it is bad practice to not point in the documentation which one of the many known standards is used. Especially when the original data is distorted by the algorithm that has no relation to the currently used technology.
When there is no detailed documentation for the function like grayscale, I would suppose the most naive way: all channels are dealt equally. That method is called ‘average grayscale method’ and it is often referred to as the default method for grayscale conversion.

It’s interesting that, unlike OpenCV and PIL framework, Torch doesn’t even use the formula defined by the BT.601 standard. It uses just something close to the standardized coefficients.
0.299R + 0.587G + 0.114B :original formula from BT.601, used by OpenCV and PIL
0.2989R + 0.587G + 0.114B :used in torchvision
Here is the link to the ITU - BT.601 standard so you can check: BT.601 : Studio encoding parameters of digital television for standard 4:3 and wide screen 16:9 aspect ratios

Let me assume, that the lack of documentation is just based on the fact that when the Grayscale function was coded nobody could explain why exactly BT.601 grayscale conversion standard was chosen, and nobody had read the original standard specifications for correct numbers, - that was just copied from some other package.

For me it’s the opposite: Using the NTSC standard needs to be justified in machine learning frameworks. My first post was about why using NTSC grayscaling here has just no plausible reasons: When copying functions from visual image processing frameworks, we should understand that they often change the original data for the sake of visual representation. Which is exactly the case with the color standard BT.601 as the other standards that are classified as perceptual and human-eye-based.
Wikipedia: Grayscale#Colorimetric_(perceptual_luminance-preserving)_conversion_to_grayscale

Pretty sure there could be practical explanations why BT.601 isn’t good in ML. Here is the paper on face detection where they show equal color distribution on general dataset.
On Conversion from Color to Gray-scale Images for Face Detection

“In comparison with this, the nonskin-color signal is almost distributed with equal strength in the R, G, B channels. Thus, we have reason to believe that for face detection, it is inappropriate to use the weighting parameters of Eq.3 [BT.601], where the signal in the most important channel (red) is significantly suppressed, while those in the other two channels of minor importance are enhanced too much.”

But for me, the general proof of managing all the color channels equally by default is pretty analytical: distorting color ratios of the original image would increase the deviation from the original image as measured by major estimators like mean squared error (MSE).
However, exaggerating some color channels in grayscale could be productive, if the dataset represents these channels more than others. Which doesn’t seem to be the case of widely used datasets.

Btw, there is an interesting paper of guys who improved the performance of the object detection algorithm YOLOv3 by using fine tuned one-channel grayscale satellite images instead of color ones:

Grayscale Based Algorithm for Remote Sensing with Deep Learning
The results show that the grayscale-based method can sense the target more accurately and effectively than the traditional YOLOv3 approach.

Hi Cepera!

Just to be clear about the claims in the paper you reference:

They take two separate RGB color-image datasets – call them “dataset
A” and “dataset B” – that are distinctly different in character. For their
baseline, they combine dataset A and dataset B together into a single
RGB dataset without modification or preprocessing, and train a network
that takes three-channel RGB images as input.

For the “Grayscale Based Algorithm” they convert dataset A from RGB
to single-channel grayscale using an ad hoc (not “learned”) custom
grayscale weighting, call it grayscale-weighting A. They separately
convert dataset B into grayscale using a second ad hoc weighting,
grayscale-weighting B. These two grayscale weightings are chosen,
in essence by hand, to “normalize” away much of the difference in
character between dataset A and dataset B. After this preprocessing
step, they combine the two datasets into a single grayscale dataset.

They then use this combined, “normalized” dataset to train a network
that takes single-channel, grayscale images as input (but that is otherwise
identical to their baseline network).

They claim better results with the grayscale dataset and single-channel

I’ll leave it to others to judge whether this is a fair test of whether grayscale
training offers benefits over RGB training.

Comment 1: They use mean-average-precision as a performance metric
and claim better (higher) mean-average precision for their grayscale
training. However they report a worse (higher) loss function for their
grayscale training (but don’t comment on the apparent discrepancy).

Comment 2: Neither of the custom grayscale weightings uses all the
colors equally, as you seem to advocate in your original post.


K. Frank

This paper is a bit old (2011), but it has compared the most used grayscale algorithms in terms of image recognition. Sorry for the long quote, but it is exactly about our topic.
They use the term ‘Intensity’ for the equal channel grayscaling method.
Luminance and Luma - are two close standards bt.601 (used in torch) and bt.709

Color-to-Grayscale: Does the Method Matter in Image Recognition?
Our objective was to determine if the method used to convert from color-to-grayscale matters, and we can definitively say that it does influence performance. For all datasets there was a significant gap between the top performing and worst performing methods. Our results indicate that the method used to convert to grayscale should be clearly described in all publications, which is not always the case in image recognition.

For object and face recognition, Gleam is almost always the top performer. For texture recognition, Luminance and Luminance are good choices. Although color descriptors are sometimes extracted in the HSV colorspace, our results suggest replacing Value with Gleam is advisable.

In general, we observed little benefit from using a method based on human brightness perception. The only potential exception was textures. Emulating the way humans perceive certain colors as brighter than others appears to be of limited benefit for grayscale image recognition. However, methods that incorporate a form of gamma correction (e.g., Lightness, Gleam, Luma, Luster, etc.) usually perform better than purely linear methods such as Intensity and Luminance.

Don’t forget that they compare the loss for two different datasets (grayscaled and 3-color) that have different data structures.
Stepping aside the comparison methodology, they did what they did: achieved high accuracy with the grayscaling, using some grayscale-specific fine tuning.

Not sure if you are interested, but there is the better quality paper that tries to explain why is it possible to get higher accuracy on grayscale:
Bui, Hieu Minh, et al. “Using grayscale images for object recognition with convolutional-recursive neural network.”

And of course, there are papers that show that for models like ResNet color images are better:
Funt, Brian, and Ligeng Zhu. “Does colour really matter? Evaluation via object classification.”

That’s a good point and I’m sure a contribution to fix the docs is more than welcome. :wink:

Thanks for the references as they are interesting to read!
However, quotes like

are not particularly convincing to break the compatibility of torchvision.transforms with other image CV libs especially since you can use your custom transformation with a single line of code.

Ok. So, having the worst accuracy (12/13 or 13/13) in the most of the datasets (3/4) for the grayscale formula used in torchvision you consider as ‘not convincing’. What would you consider as convincing? Do you have any falsifiable thesis?

Right, as I don’t know how transferable “SIFT [14], SURF [15], Geometric Blur [16], and Local Binary Patterns (LBP) [17].” to ML/DL learning tasks are.
I’m not trying to disproof the author’s findings, but if the answer to the “best” reduction weights is: “well, it depends on your data, algorithm, task, …” then I still think sticking to the standard approach used other libs would be the right decision to avoid potentially breaking other models.

Again, if your transformation needs other weights, use transforms.Lambda and apply your custom transformation (or write a custom transform class alternatively) .