Using scikit-learn's scalers for torchvision

I noticed an improvement by doing per-channel normalization (6-channel images). It would be nice to simply use scikit-learn’s scalers like MinMaxScaler, but I noticed it’s much slower. The code for doing it is (inside __getitem__):

scaler = MinMaxScaler()
for i in range(img.size()[0]):
    img[i] = torch.tensor(scaler.fit_transform(img[i]))

I tried to code it myself using PyTorch. For the MinMaxScaler I wrote:

class MinMaxScaler(object):
    """
    Transforms each channel to the range [0, 1].
    """
    
    def __call__(self, tensor):
        
        for ch in tensor:
            scale = 1.0 / (ch.max() - ch.min())
            ch.mul_(scale).sub_(ch.min().mul_(scale))
        
        return tensor

The problem is that I do not obtain the same results as the original scaler. Do you spot anything wrong in my code? Is per-channel scaling implemented already in torchvision?

Since you are working inplace on ch, you don’t need the second multiplication with scale in your custom implementation. ch.min() will give you the new minimal value, which doesn’t need to be scaled again.

Also, you would need to get the max and min values in dim0 as done in the sklearn implementation.

This implementation should work:

class PyTMinMaxScaler(object):
    """
    Transforms each channel to the range [0, 1].
    """    
    def __call__(self, tensor):
        for ch in tensor:
            scale = 1.0 / (ch.max(dim=0)[0] - ch.min(dim=0)[0])        
            ch.mul_(scale).sub_(ch.min(dim=0)[0])        
        return tensor

However, the loop will slow down your code.
To get the most out of PyTorch, you should use vectorized code:

class PyTMinMaxScalerVectorized(object):
    """
    Transforms each channel to the range [0, 1].
    """
    def __call__(self, tensor):
        scale = 1.0 / (tensor.max(dim=1, keepdim=True)[0] - tensor.min(dim=1, keepdim=True)[0]) 
        tensor.mul_(scale).sub_(tensor.min(dim=1, keepdim=True)[0])
        return tensor

Let’s check, if we get the same values:

img1 = torch.randn(6, 100, 100)
img2 = img1.clone()
img3 = img1.clone()

# sklearn
scaler = MinMaxScaler()
for i in range(img1.size()[0]):
    img1[i] = torch.tensor(scaler.fit_transform(img1[i]))

# PyTorch manual
scaler = PyTMinMaxScaler()
out2 = scaler(img2)

# PyTorch fast
scaler_fast = PyTMinMaxScalerVectorized()
out3 = scaler_fast(img3)

print((img1 - out2).abs().max())
> tensor(1.1921e-07)
print((img1 - out3).abs().max())
> tensor(1.1921e-07)
print((out2 == out3).all())
> tensor(True)

That looks good! The small differences are due to the limited floating point precision.

Let’s see, how fast the PyTorch versions are on the CPU via %timeit (on my old laptop):

%timeit scaler(img2)
> 1.85 ms ± 208 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit scaler_fast(img3)
> 529 µs ± 44.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

If you are using a modern CPU, your code should be way faster. :wink:

Thanks for the answer. Unfortunately, I’m having some issues with your solution.
I just changed GPU and using the same code but with your scaler implementation, I kept getting NaN for the training loss. I thought I had some issues with the new GPU but after a day of experiments I finally realized that if I do not use your scaler, I get the right losses! Moreover, it’s much faster if I do not scale the images (from roughly 19 minutes to 12).

Do you only get the NaN values using one of my implementations or also your original one using sklearn?
If just using mine, could you post an input tensor, which creates these NaN values?
It’s a bit weird, as I’ve tested your approach to mine and got the same results (up to floating point precision).
So if the scaling creates NaNs in your training, both approaches should behave the same.

Are you getting the right losses using the sklearn approach now?

This is expected, as you will save some operations or am I missing something?

Using my implementation (with for loop over the channels, the MinMaxScaler class) I get something (not NaN). With yours, I get NaN after the first epoch already. Everything else is exactly the same.
I’m working with images, I wouldn’t know how to post an example: is just 1 tensor (512x512) enough?

This should be caused by dividing by zero, if the max and min values are equal.
I naively implemented sklearn’s approach without checks.
Add this check to your implementation:

class PyTMinMaxScalerVectorized(object):
    """
    Transforms each channel to the range [0, 1].
    """
    def __call__(self, tensor):
        dist = (tensor.max(dim=1, keepdim=True)[0] - tensor.min(dim=1, keepdim=True)[0])
        dist[dist==0.] = 1.
        scale = 1.0 /  dist
        tensor.mul_(scale).sub_(tensor.min(dim=1, keepdim=True)[0])
        return tensor
3 Likes

It works perfectly, thank you very much!

1 Like

For those who want the general case:

class MinMaxScalerVectorized(object):
    """MinMax Scaler

    Transforms each channel to the range [a, b].

    Parameters
    ----------
    feature_range : tuple
        Desired range of transformed data.
    """

    def __init__(self, **kwargs):
        self.__dict__.update(kwargs)

    def __call__(self, tensor):
        """Fit features

        Parameters
        ----------
        stacked_features : tuple, list
            List of stacked features.

        Returns
        -------
        tensor 
            A tensor with scaled features using requested preprocessor.
        """

        tensor = torch.stack(tensor)

        # Feature range
        a, b = self.feature_range

        dist = tensor.max(dim=0, keepdim=True)[0] - tensor.min(dim=0, keepdim=True)[0]
        dist[dist == 0.0] = 1.0
        scale = 1.0 / dist
        tensor.mul_(scale).sub_(tensor.min(dim=0, keepdim=True)[0])
        tensor.mul_(b - a).add_(a)

        return tensor

Usage:

scaler = MinMaxScaler(feature_range=(-1, 1))
scaled_data = scaler(data)
1 Like

Hi, I could not understand the order of addition and multiplication in

tensor.mul_(scale).sub_(tensor.min(dim=0, keepdim=True)[0])

My understanding of the standard min-max scaling is (x-xmin)*scaler where scaler = 1/(xmax-xmin). So, could you pls. help me understand why the mul_() comes before the subtraction.