Image input range to convolutional network

When the input of a convolutional neural network is a (grayscale) image, it is common practice to scale the pixel values to the range [0,1], for example by using the torchvision.transforms.ToTensor transform.

Many works then proceed to subtract 0.5 and scale by 2 in order to transform this range to [-1, 1]. This is of course a normalization step. However, both in literature and applications, there are many cases where this last transform is not performed.

What would be the argument for (not) applying this normalization step? Why do some projects incorporate this step in their design and others do not?

Normalizing/whitening/decorrelating the input is beneficial as for very simple examples it can be shown that the loss surface is also decorrelated/round and your model can thus be trained easier.
I would generally recommend to take a look at Bishop’s Pattern Recognition and Machine Learning as well as the Deep Learning Book.

What about the specific case for images? Are projects that do not normalize missing out on potential improvement? Is there any reason not to normalize?