How to efficiently normalize a batch of tensor to [0, 1]

Hi,

I have a batch of tensor. How can I efficiently normalize it to the range of [0, 1].

For example,
The tensor is A with dimension [batch=25, height=3, width=3]. I can use for-loop to finish this normalization like

# batchwise normalize to [0, 1] along with height and width
for i in range(batch):          
    min_ele = torch.min(A[i])
    A[i] -= min_ele
    A[i] /= torch.max(A[i])

However, this solution is low. Is there any efficient way?

Thanks!

You could calculate the min and max values directly for all samples in the batch and apply the normalization:

A -= A.min(1, keepdim=True)[0]
A /= A.max(1, keepdim=True)[0]
8 Likes

Hi, @ptrblck

Thanks for your reply. However, I want to calculate the minimum and maximum element along with both height and width dimension.

For example, we have a tensor a=[[1,2],[3,4]], the min/max element should be 1 and 4

a = torch.Tensor([[1,2],[3,4]])
torch.min(a)  # this function will return 1
torch.max(a) # return 4

I have tried your solution, it gives a vector.

a = torch.Tensor([[1,2],[3,4]])
a.min(1, keepdim=True)[0] # this gives [1, 2]
a.max(1, keepdim=True)[0] # this gives [3, 4]

Based on your code snippet I assumed batch would correspond to the batch size, and my code snippet would yield the same result as yours.
Could you post the shape of A and an executable code snippet (using random values for A)?

@ptrblck Here it is

batch_size = 3
height = 2
width = 2

A = torch.randint(2, 11, (batch_size, height, width)).float()
AA = A.clone()
print(A)

# I can get what I want from below for-loop solution
for i in range(batch_size):
    A[i] -= torch.min(A[i])
    A[i] /= torch.max(A[i])

# Your solution
AA -= AA.min(1, keepdim=True)[0]
AA /= AA.max(1, keepdim=True)[0]

print(A)  # A and AA are different
print(AA)

Thanks for the code.
This should work:

AA = AA.view(A.size(0), -1)
AA -= AA.min(1, keepdim=True)[0]
AA /= AA.max(1, keepdim=True)[0]
AA = AA.view(batch_size, height, width)
9 Likes

Got you. Thanks!

Is your solution much faster than for-loop?

For your specified sizes, I get these numbers on a CPU:

# your method
402 µs ± 26.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

# my suggestion
115 µs ± 7.22 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
4 Likes

Hello @ptrblck !

strange, but your approach with view’s is very slow.

It is faster than loop approach when I use timeit, but inference pipeline got slower in 10 times (with for loop is about 50 FPS, with views about 5 FPS)

EDIT 1:
Just added torch.cuda.synchronize()

  1. for loop: 0.5 ms
  2. view approach: 150 ms

I don’t understand what happens, view shouldn’t change tensor itself (from continuous to non continuous)

Do you have any thoughts?

Additional info:
I use CUDA tensor with shape [B, 3, 1024, 1024]
torch version: 1.2.0
cuda version: 10.0.130
GPU: NVIDIA QUADRO GV100
OS: linux

The view operation should be really cheap, as it only changes the meta-data, i.e. no copy will be performed and you would get an error, if a copy is necessary.

Could you post your profiling code so that I could take a look, please?

my profiling code is here:

    @contextmanager
    def timeit(msg):
        start = time.time_ns()
        yield
        end = time.time_ns()

        result = round((end - start) * 1e-6, 2)
        msg = f"timeit: {result:<10} ms " + msg

        print(msg)

then I use it like so:

with timeit("normalization"):
    # code here
    torch.cuda.synchronize()

But overall speed (on 1k images) is really different, it is notable w/o any measurements.

1 Like

The performance drop in is in min() function
I read that it calls item(), and it is slow.

The min and max operations return a tensor in my example, so no synchronizing item() operation is performed.

I used this code snippet:


batch_size = 3
height = 2
width = 2

A = torch.randint(2, 11, (batch_size, height, width)).float().cuda()
AA = A.clone()
print(A)

def fun1(A):
    for i in range(batch_size):
        A[i] -= torch.min(A[i])
        A[i] /= torch.max(A[i])

# Your solution
def fun2(AA):
    AA = AA.view(AA.size(0), -1)
    AA -= AA.min(1, keepdim=True)[0]
    AA /= AA.max(1, keepdim=True)[0]
    AA = AA.view(batch_size, height, width)


nb_iters = 100
torch.cuda.synchronize()
t0 = time.time()
for _ in range(nb_iters):
    fun1(A)
torch.cuda.synchronize()
t1 = time.time()
print((t1 - t0)/nb_iters)


torch.cuda.synchronize()
t0 = time.time()
for _ in range(nb_iters):
    fun2(AA)
torch.cuda.synchronize()
t1 = time.time()
print((t1 - t0)/nb_iters)

And get these results:

  • CPU: original: 9.6135e-05 sec/iter, mine: 2.3348e-05 s/iter
  • GPU: original: 0.0004302668 s/iter, mine: 7.0183e-05 s/iter

Note that the original workload is really small, so using:

batch_size = 300
height = 200
width = 200

I get:

  • CPU: 0.0646844482421875 s/iter vs. 0.01887556552886963 s/iter
  • GPU: 0.040595355033874514 s/iter vs. 0.0023879456520080567 s/iter

Thank you for checking it out!
Just run your code as is (w/o any modifications, so it’s cuda tensor) and:
on sizes 1:

batch_size = 300
height = 200 
width = 200 

got

0.028152191638946535
0.0026354169845581054

on sizes 2:

batch_size = 8 
height = 1024 
width = 1024 

got

0.0010659313201904297
0.05072437047958374

and on sizes 3:

batch_size = 2
height = 4096 
width = 4096 

got

0.0024710512161254883
0.7927418446540833

So, it gets slower when h and w is increasing, isn’t it weird?

Update 1:
same situation on torch version 1.5.0

I think the overhead is created by the torch.max call with a dimension keyword, as it’ll also return the indices.
In the last use case the batch dimension can more or less be ignored, as it’s much smaller than the flattened height*width.
If I’m not mistaken, there was recently a feature request to add a return_indices argument to torch.max.

Got it.
I’m confirming that overhead is due to dimension keyword: just tried torch.min(AA) (view tensor) and didn’t noticed any overhead.

Thank you for the help!

1 Like

@ptrblck , say I have a dataloader and want to normalize my whole dataset using the min_max scaling solution above. Would it be a good approach to do it on every batch (where the mins and max would differ)? Or what would be a better approach?

Usually you wouldn’t normalize with the batch statistics directly, since you would also need to do the same during your inference/deployment code.
Depending on your use case it could work (e.g. if your test use case is also using the same batch size) or it could change the behavior of your model a bit, since the min and max values depend on the actual noise (the effect might be small, so it could still work, but you would need to verify it).

I see, thanks for the reply.
How could your snippet be used for normalizing the samples in a dataloader then?
Say I got the following, is this the correct approach?:

for i_batch, batch in enumerate(iter(dataloader), start=1):
     mixed_audio_ba, voice_audio_ba, _ = batch

     mixed_audio_ba -= mixed_audio_ba.min(1, keepdim=True)[0]
     mixed_audio_ba /= mixed_audio_ba.max(1, keepdim=True)[0]

     voice_audio_ba-= voice_audio_ba.min(1, keepdim=True)[0]
     voice_audio_ba/= voice_audio_ba.max(1, keepdim=True)[0]
     
     ...

     model_out = model(mixed_audio_ba, voice_audio_ba)

The posted approach would also normalize the entire batch. As I previously mentioned, it could work but note that this normalization now depends on the batch size and could thus change the behavior of the model if the batch size is changed e.g. during inference. The change might be minimal and the model might still perform well, so you would need to check it in your use case.