Based on your code snippet I assumed batch would correspond to the batch size, and my code snippet would yield the same result as yours.
Could you post the shape of A and an executable code snippet (using random values for A)?
batch_size = 3
height = 2
width = 2
A = torch.randint(2, 11, (batch_size, height, width)).float()
AA = A.clone()
print(A)
# I can get what I want from below for-loop solution
for i in range(batch_size):
A[i] -= torch.min(A[i])
A[i] /= torch.max(A[i])
# Your solution
AA -= AA.min(1, keepdim=True)[0]
AA /= AA.max(1, keepdim=True)[0]
print(A) # A and AA are different
print(AA)
strange, but your approach with view’s is very slow.
It is faster than loop approach when I use timeit, but inference pipeline got slower in 10 times (with for loop is about 50 FPS, with views about 5 FPS)
EDIT 1:
Just added torch.cuda.synchronize()
for loop: 0.5 ms
view approach: 150 ms
I don’t understand what happens, view shouldn’t change tensor itself (from continuous to non continuous)
Do you have any thoughts?
Additional info:
I use CUDA tensor with shape [B, 3, 1024, 1024]
torch version: 1.2.0
cuda version: 10.0.130
GPU: NVIDIA QUADRO GV100
OS: linux
The view operation should be really cheap, as it only changes the meta-data, i.e. no copy will be performed and you would get an error, if a copy is necessary.
Could you post your profiling code so that I could take a look, please?
I think the overhead is created by the torch.max call with a dimension keyword, as it’ll also return the indices.
In the last use case the batch dimension can more or less be ignored, as it’s much smaller than the flattened height*width.
If I’m not mistaken, there was recently a feature request to add a return_indices argument to torch.max.