I have a tensor of shape `z = (38, 38, 7, 7, 21) = (x_pos, y_pos, grid_i, grid_j, class_num)`

, and I wish to normalize it according to the formula:

I have produced a working example of what I mean here, and the problem is that it is extremely slow, approximately 2-3 seconds for each grid entry (of which there are 49, so 49*3 seconds = 147 seconds, which is way too long, considering I need to do this with thousands of image feature maps).

Any optimizations or obvious problems very much appreciated. This is part of a Pytorch convolutional neural network architecture, so I am using torch tensors and tensor ops.

```
import torch
def normalizeScoreMap(score_map):
for grid_i in range(7):
for grid_j in range(7):
for x in range(38):
for y in range(38):
grid_sum = torch.tensor(0.0).cuda()
for class_num in range(21):
grid_sum += torch.pow(score_map[x][y][grid_i][grid_j][class_num], 2)
grid_normalizer = torch.sqrt(grid_sum)
for class_num in range(21):
score_map[x][y][grid_i][grid_j][class_num] /= grid_normalizer
return score_map
random_score_map = torch.rand(38,38,7,7,21).cuda()
score_map = normalizeScoreMap(random_score_map)
```

Edit: For reference I have an i9-9900K CPU and a nvidia 2080 GPU, so my hardware is quite good. I would be willing to try multi-threading but I am looking for more obvious problems/optimizations.