I need to compute covariance of input feature maps of a Conv2d layer.

I’m using a `forward_pre_hook`

to achieve this, as follows:

```
def _forward_hook(self, module, input):
if torch.is_grad_enabled():
x = input[0].data
if module not in self.m_aa:
self.m_aa[module] = torch.zeros((x.size(1), x.size(1))).to(device)
if isinstance(module, nn.Conv2d):
# [N, C, H, W]
for h in range(x.size(2)): # Parallelizable?
for w in range(x.size(3)):
self.m_aa[module] += (x[:,:,h,w].T @ x[:,:,h,w])
```

This is extremely slow. How can I better utilize the compute by parallelizing over `H`

and `W`

of the Kernel?