Nonlocal variable is not being updated in a multiprocessed nested MSELoss function

I have written a function to parallelize the MSELoss function from torch.nn to speed things up for multiple loss calculations. The code is as shown below:

from torch.nn import MSELoss
import torch.multiprocessing as mp
from multiprocessing import Process, Lock
import torch

def paral_MSELoss(*args):          # Parallelized MSELoss for multiple Inputs in form of (x1,x2,y1,y2,z1,z2)
    assert len(args) % 2 == 0, "Argument number of paral_MSELoss must be even"
    out = [None] * (len(args)//2)
    vals = list(args)
    mutex = Lock()
    loss = MSELoss()
    def append_mul(index, input_1, input_2):
        nonlocal out, loss
        temp = loss(input_1, input_2)
        with mutex:
            out[index] = temp
    for i in range(len(vals)):
        if vals[i].device.type == 'cpu':
    processes = []
    for mp_loop in range(0,len(vals),2):
        p = mp.Process(target=append_mul, args=(mp_loop//2, vals[mp_loop], vals[mp_loop+1]))
    for p in processes: p.join()
    return out

But as I run the program for an example array set, it doesn’t update the out variable and returns me an empty numpy array.

Test code is as below:

a = torch.tensor([[2.22,2.56],[3,4]],dtype=torch.float64)
b = torch.tensor([[1,2],[3,4]], dtype=torch.float64)
c = torch.tensor([[1,1.23],[3,4]],dtype=torch.float64)
d = torch.tensor([[1,5],[2,4.65]],dtype=torch.float64)

out = f.paral_MSELoss(a,b,c,d)

loss = MSELoss()
out2 = [loss(a,b), loss(c,d)]


and it returns :

[None, None]
[tensor(0.4505, dtype=torch.float64), tensor(3.9089, dtype=torch.float64)]

Thanks in advance!


Just as a first point, MSELoss already uses multithreading to speed up computation on multi-core machines. Why is this necessary for you?

I guess the problem is that in a multiprocess environment the out that is given to your child process is not the same as the one in your main process. So you only modify the one local to the other process and not the one from your main process.

Hi.Thanks for your reply!

I want to create a model that gives me more than 10 output matrices and want to speed up the whole process by running multiple MSELoss functions at once on GPU instead of calling them in a for loop. With the normal usage, my GPU-Load is too low, which is just waste of time

With nonlocal out, child processes are actually supposed to use the main out of paral_MSELoss(). I also tried with the more powerful global out(which is actually wrong in my opinion) and that doesn’t work as well.

If your GPU usage is too low, that means that you give it either too small tasks or other things in your code is the slow part.
I would check which one it is and fix these.
Multiprocessing will only create smaller tasks and add overhead so most likely won’t solve your problem…

1 Like

Multiprocessing will only create smaller tasks and add overhead so most likely won’t solve your problem…

Seems like it doesn’t, when I add time functions. Thanks for your time!