ToTensor scaling to [0,1] resulting in smaller gradients

Hi everyone,
I’m trying to understand the reasoning behind the scaling of PIL 8-bit images from the range [0,255] to [0,1]. I am aware of whitening techniques and assume that this practice is something similar or aims to accomplish the same goal, but in practice this scaling feels counter productive.

I don’t have any rigorous understanding or backing for this, but when I’m training models, reconstruction losses like MSE always seem to work better with [0,255] images. Intuitively, I assumed that calculating a numerically higher loss value results in higher gradients which result in a rapid update of parameters to better achieve better reconstruction. I would appreciate any clarification.

In addition to this, when I’m working with [0,255] images I tend to use the default 1e-3 as my learning rate. Based off of this, should I be using an even smaller learning rate when working with [0,1] images? Something like 1e-5 or 1e-4? Any advice would be very much

The underlying reason are derivatives, here is a simple idea of what happens.

The derivative of f(x) = x^2 is f'(x) = 2 x. If we start at x = 5 then f(5) = 25.

We want to move to the minimum of f(x). So here MSE is like f(x).

SGD moves x to the minimum of f(x) by using this formula new_x = old_x - 2x * lr

lr is controls the update to x; in fact, here x is like a weight.

With lr=1 we have new_x = 5 - 10 = -5, and indeed we will keep oscillating forever.

With lr=0.1 ? Then new_x = 5 - 1 = 4 and this new x is closer to the minimum because f(4)=16.

We can extract some answers:

  • If the learning rate is too high, the model wont converge (oscillates.)
  • If the learning rate is too small, it may take too long

In the previous formula, the pixels are more like the 2 in new_x = old_x - 2 x*lr

With large values instead of a 2 the gradient may become very large, for example:

new_x = 5 - 255 5*lr

We could use a very small learning rate here like 10^-8 but apparently this isn’t best in practice.

There is no good rule for the lr I normally start with 0.01 and automatically reduce it 10X when learning stagnates.

Thank you so much for your answer! This is definitely informative. I just wanted to add, when you say you “automatically reduce the learning rate 10x when learning stagnates”, do you imply that you do this while training with code? Or do you train up to a checkpoint, analyze loss and gradient graphs and train with a lower learning rate from that checkpoint onward?

You are welcome.

No, there is an automatic LR scheduler that does this in torch, search for “pytorch learning rate scheduler”, to do it, it actually computes the Loss in the test set then changes the learning rate ( not fully sure, but I think it was like that.)