# How to efficiently normalize a batch of tensor to [0, 1]

Hi,

I have a batch of tensor. How can I efficiently normalize it to the range of [0, 1].

For example,
The tensor is A with dimension [batch=25, height=3, width=3]. I can use for-loop to finish this normalization like

``````# batchwise normalize to [0, 1] along with height and width
for i in range(batch):
min_ele = torch.min(A[i])
A[i] -= min_ele
A[i] /= torch.max(A[i])
``````

However, this solution is low. Is there any efficient way?

Thanks!

You could calculate the min and max values directly for all samples in the batch and apply the normalization:

``````A -= A.min(1, keepdim=True)[0]
A /= A.max(1, keepdim=True)[0]
``````
6 Likes

Hi, @ptrblck

Thanks for your reply. However, I want to calculate the minimum and maximum element along with both height and width dimension.

For example, we have a tensor a=[[1,2],[3,4]], the min/max element should be 1 and 4

``````a = torch.Tensor([[1,2],[3,4]])
torch.min(a)  # this function will return 1
torch.max(a) # return 4
``````

I have tried your solution, it gives a vector.

``````a = torch.Tensor([[1,2],[3,4]])
a.min(1, keepdim=True)[0] # this gives [1, 2]
a.max(1, keepdim=True)[0] # this gives [3, 4]
``````

Based on your code snippet I assumed `batch` would correspond to the batch size, and my code snippet would yield the same result as yours.
Could you post the shape of `A` and an executable code snippet (using random values for `A`)?

@ptrblck Here it is

``````batch_size = 3
height = 2
width = 2

A = torch.randint(2, 11, (batch_size, height, width)).float()
AA = A.clone()
print(A)

# I can get what I want from below for-loop solution
for i in range(batch_size):
A[i] -= torch.min(A[i])
A[i] /= torch.max(A[i])

AA -= AA.min(1, keepdim=True)[0]
AA /= AA.max(1, keepdim=True)[0]

print(A)  # A and AA are different
print(AA)

``````

Thanks for the code.
This should work:

``````AA = AA.view(A.size(0), -1)
AA -= AA.min(1, keepdim=True)[0]
AA /= AA.max(1, keepdim=True)[0]
AA = AA.view(batch_size, height, width)
``````
7 Likes

Got you. Thanks!

Is your solution much faster than for-loop?

For your specified sizes, I get these numbers on a CPU:

``````# your method
402 Âµs Â± 26.5 Âµs per loop (mean Â± std. dev. of 7 runs, 1000 loops each)

# my suggestion
115 Âµs Â± 7.22 Âµs per loop (mean Â± std. dev. of 7 runs, 10000 loops each)
``````
4 Likes

Hello @ptrblck !

strange, but your approach with viewâ€™s is very slow.

It is faster than loop approach when I use `timeit`, but inference pipeline got slower in 10 times (with for loop is about 50 FPS, with views about 5 FPS)

EDIT 1:
Just added `torch.cuda.synchronize()`

1. for loop: 0.5 ms
2. view approach: 150 ms

I donâ€™t understand what happens, view shouldnâ€™t change tensor itself (from continuous to non continuous)

Do you have any thoughts?

I use CUDA tensor with shape [B, 3, 1024, 1024]
torch version: 1.2.0
cuda version: 10.0.130
OS: linux

The `view` operation should be really cheap, as it only changes the meta-data, i.e. no copy will be performed and you would get an error, if a copy is necessary.

Could you post your profiling code so that I could take a look, please?

my profiling code is here:

``````    @contextmanager
def timeit(msg):
start = time.time_ns()
yield
end = time.time_ns()

result = round((end - start) * 1e-6, 2)
msg = f"timeit: {result:<10} ms " + msg

print(msg)
``````

then I use it like so:

``````with timeit("normalization"):
# code here
torch.cuda.synchronize()
``````

But overall speed (on 1k images) is really different, it is notable w/o any measurements.

The performance drop in is in `min()` function
I read that it calls `item()`, and it is slow.

The `min` and `max` operations return a tensor in my example, so no synchronizing `item()` operation is performed.

I used this code snippet:

``````
batch_size = 3
height = 2
width = 2

A = torch.randint(2, 11, (batch_size, height, width)).float().cuda()
AA = A.clone()
print(A)

def fun1(A):
for i in range(batch_size):
A[i] -= torch.min(A[i])
A[i] /= torch.max(A[i])

def fun2(AA):
AA = AA.view(AA.size(0), -1)
AA -= AA.min(1, keepdim=True)[0]
AA /= AA.max(1, keepdim=True)[0]
AA = AA.view(batch_size, height, width)

nb_iters = 100
torch.cuda.synchronize()
t0 = time.time()
for _ in range(nb_iters):
fun1(A)
torch.cuda.synchronize()
t1 = time.time()
print((t1 - t0)/nb_iters)

torch.cuda.synchronize()
t0 = time.time()
for _ in range(nb_iters):
fun2(AA)
torch.cuda.synchronize()
t1 = time.time()
print((t1 - t0)/nb_iters)
``````

And get these results:

• CPU: original: `9.6135e-05 sec/iter`, mine: `2.3348e-05 s/iter`
• GPU: original: `0.0004302668 s/iter`, mine: `7.0183e-05 s/iter`

Note that the original workload is really small, so using:

``````batch_size = 300
height = 200
width = 200
``````

I get:

• CPU: `0.0646844482421875 s/iter` vs. `0.01887556552886963 s/iter`
• GPU: `0.040595355033874514 s/iter` vs. `0.0023879456520080567 s/iter`

Thank you for checking it out!
Just run your code as is (w/o any modifications, so itâ€™s cuda tensor) and:
on sizes 1:

``````batch_size = 300
height = 200
width = 200
``````

got

``````0.028152191638946535
0.0026354169845581054
``````

on sizes 2:

``````batch_size = 8
height = 1024
width = 1024
``````

got

``````0.0010659313201904297
0.05072437047958374
``````

and on sizes 3:

``````batch_size = 2
height = 4096
width = 4096
``````

got

``````0.0024710512161254883
0.7927418446540833
``````

So, it gets slower when `h` and `w` is increasing, isnâ€™t it weird?

Update 1:
same situation on torch version 1.5.0

I think the overhead is created by the `torch.max` call with a dimension keyword, as itâ€™ll also return the indices.
In the last use case the batch dimension can more or less be ignored, as itâ€™s much smaller than the flattened `height*width`.
If Iâ€™m not mistaken, there was recently a feature request to add a `return_indices` argument to `torch.max`.

Got it.
Iâ€™m confirming that overhead is due to dimension keyword: just tried `torch.min(AA)` (view tensor) and didnâ€™t noticed any overhead.

Thank you for the help!

1 Like