How to efficiently normalize a batch of tensor to [0, 1]

Hi,

I have a batch of tensor. How can I efficiently normalize it to the range of [0, 1].

For example,
The tensor is A with dimension [batch=25, height=3, width=3]. I can use for-loop to finish this normalization like

# batchwise normalize to [0, 1] along with height and width
for i in range(batch):          
    min_ele = torch.min(A[i])
    A[i] -= min_ele
    A[i] /= torch.max(A[i])

However, this solution is low. Is there any efficient way?

Thanks!

You could calculate the min and max values directly for all samples in the batch and apply the normalization:

A -= A.min(1, keepdim=True)[0]
A /= A.max(1, keepdim=True)[0]
6 Likes

Hi, @ptrblck

Thanks for your reply. However, I want to calculate the minimum and maximum element along with both height and width dimension.

For example, we have a tensor a=[[1,2],[3,4]], the min/max element should be 1 and 4

a = torch.Tensor([[1,2],[3,4]])
torch.min(a)  # this function will return 1
torch.max(a) # return 4

I have tried your solution, it gives a vector.

a = torch.Tensor([[1,2],[3,4]])
a.min(1, keepdim=True)[0] # this gives [1, 2]
a.max(1, keepdim=True)[0] # this gives [3, 4]

Based on your code snippet I assumed batch would correspond to the batch size, and my code snippet would yield the same result as yours.
Could you post the shape of A and an executable code snippet (using random values for A)?

@ptrblck Here it is

batch_size = 3
height = 2
width = 2

A = torch.randint(2, 11, (batch_size, height, width)).float()
AA = A.clone()
print(A)

# I can get what I want from below for-loop solution
for i in range(batch_size):
    A[i] -= torch.min(A[i])
    A[i] /= torch.max(A[i])

# Your solution
AA -= AA.min(1, keepdim=True)[0]
AA /= AA.max(1, keepdim=True)[0]

print(A)  # A and AA are different
print(AA)

Thanks for the code.
This should work:

AA = AA.view(A.size(0), -1)
AA -= AA.min(1, keepdim=True)[0]
AA /= AA.max(1, keepdim=True)[0]
AA = AA.view(batch_size, height, width)
7 Likes

Got you. Thanks!

Is your solution much faster than for-loop?

For your specified sizes, I get these numbers on a CPU:

# your method
402 µs ± 26.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

# my suggestion
115 µs ± 7.22 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
4 Likes

Hello @ptrblck !

strange, but your approach with view’s is very slow.

It is faster than loop approach when I use timeit, but inference pipeline got slower in 10 times (with for loop is about 50 FPS, with views about 5 FPS)

EDIT 1:
Just added torch.cuda.synchronize()

  1. for loop: 0.5 ms
  2. view approach: 150 ms

I don’t understand what happens, view shouldn’t change tensor itself (from continuous to non continuous)

Do you have any thoughts?

Additional info:
I use CUDA tensor with shape [B, 3, 1024, 1024]
torch version: 1.2.0
cuda version: 10.0.130
GPU: NVIDIA QUADRO GV100
OS: linux

The view operation should be really cheap, as it only changes the meta-data, i.e. no copy will be performed and you would get an error, if a copy is necessary.

Could you post your profiling code so that I could take a look, please?

my profiling code is here:

    @contextmanager
    def timeit(msg):
        start = time.time_ns()
        yield
        end = time.time_ns()

        result = round((end - start) * 1e-6, 2)
        msg = f"timeit: {result:<10} ms " + msg

        print(msg)

then I use it like so:

with timeit("normalization"):
    # code here
    torch.cuda.synchronize()

But overall speed (on 1k images) is really different, it is notable w/o any measurements.

The performance drop in is in min() function
I read that it calls item(), and it is slow.

The min and max operations return a tensor in my example, so no synchronizing item() operation is performed.

I used this code snippet:


batch_size = 3
height = 2
width = 2

A = torch.randint(2, 11, (batch_size, height, width)).float().cuda()
AA = A.clone()
print(A)

def fun1(A):
    for i in range(batch_size):
        A[i] -= torch.min(A[i])
        A[i] /= torch.max(A[i])

# Your solution
def fun2(AA):
    AA = AA.view(AA.size(0), -1)
    AA -= AA.min(1, keepdim=True)[0]
    AA /= AA.max(1, keepdim=True)[0]
    AA = AA.view(batch_size, height, width)


nb_iters = 100
torch.cuda.synchronize()
t0 = time.time()
for _ in range(nb_iters):
    fun1(A)
torch.cuda.synchronize()
t1 = time.time()
print((t1 - t0)/nb_iters)


torch.cuda.synchronize()
t0 = time.time()
for _ in range(nb_iters):
    fun2(AA)
torch.cuda.synchronize()
t1 = time.time()
print((t1 - t0)/nb_iters)

And get these results:

  • CPU: original: 9.6135e-05 sec/iter, mine: 2.3348e-05 s/iter
  • GPU: original: 0.0004302668 s/iter, mine: 7.0183e-05 s/iter

Note that the original workload is really small, so using:

batch_size = 300
height = 200
width = 200

I get:

  • CPU: 0.0646844482421875 s/iter vs. 0.01887556552886963 s/iter
  • GPU: 0.040595355033874514 s/iter vs. 0.0023879456520080567 s/iter

Thank you for checking it out!
Just run your code as is (w/o any modifications, so it’s cuda tensor) and:
on sizes 1:

batch_size = 300
height = 200 
width = 200 

got

0.028152191638946535
0.0026354169845581054

on sizes 2:

batch_size = 8 
height = 1024 
width = 1024 

got

0.0010659313201904297
0.05072437047958374

and on sizes 3:

batch_size = 2
height = 4096 
width = 4096 

got

0.0024710512161254883
0.7927418446540833

So, it gets slower when h and w is increasing, isn’t it weird?

Update 1:
same situation on torch version 1.5.0

I think the overhead is created by the torch.max call with a dimension keyword, as it’ll also return the indices.
In the last use case the batch dimension can more or less be ignored, as it’s much smaller than the flattened height*width.
If I’m not mistaken, there was recently a feature request to add a return_indices argument to torch.max.

Got it.
I’m confirming that overhead is due to dimension keyword: just tried torch.min(AA) (view tensor) and didn’t noticed any overhead.

Thank you for the help!

1 Like