I have written a function to parallelize the MSELoss function from torch.nn to speed things up for multiple loss calculations. The code is as shown below:

```
from torch.nn import MSELoss
import torch.multiprocessing as mp
from multiprocessing import Process, Lock
import torch
def paral_MSELoss(*args): # Parallelized MSELoss for multiple Inputs in form of (x1,x2,y1,y2,z1,z2)
assert len(args) % 2 == 0, "Argument number of paral_MSELoss must be even"
out = [None] * (len(args)//2)
vals = list(args)
mutex = Lock()
loss = MSELoss()
def append_mul(index, input_1, input_2):
nonlocal out, loss
temp = loss(input_1, input_2)
with mutex:
out[index] = temp
for i in range(len(vals)):
if vals[i].device.type == 'cpu':
vals[i].cuda()
processes = []
for mp_loop in range(0,len(vals),2):
p = mp.Process(target=append_mul, args=(mp_loop//2, vals[mp_loop], vals[mp_loop+1]))
p.start()
processes.append(p)
for p in processes: p.join()
return out
```

But as I run the program for an example array set, it doesnâ€™t update the out variable and returns me an empty numpy array.

Test code is as below:

```
a = torch.tensor([[2.22,2.56],[3,4]],dtype=torch.float64)
b = torch.tensor([[1,2],[3,4]], dtype=torch.float64)
c = torch.tensor([[1,1.23],[3,4]],dtype=torch.float64)
d = torch.tensor([[1,5],[2,4.65]],dtype=torch.float64)
out = f.paral_MSELoss(a,b,c,d)
loss = MSELoss()
out2 = [loss(a,b), loss(c,d)]
print(out)
print(out2)
```

and it returns :

```
[None, None]
[tensor(0.4505, dtype=torch.float64), tensor(3.9089, dtype=torch.float64)]
```

Thanks in advance!