Parallelize inputs on CPU

I have a learning algorithm that tries to reconstruct the input data using a particular model.
My forward problem is to reconstruct the 10 input data, and compute the sum of the losses between reconstructions and inputs.
I would like the 10 reconstructions to be done in parallel on the CPU, with 10 processes.
I don’t know if it’s even possible, meaning if autograd can be used in different processes.

I have tried so many approaches, mainly with torch.multiprocessing.spawn and there always seems to be a problem with different things.

What is the best way to do this ?
Here is the single-process code:

import numpy as np
import torch

def forward_routine(leaf):

    xi = lambda x: apply_kernel(leaf,x,extra_param)
    err = np.zeros([S,L])
    for j in range(S):      # S = 10
        # xi is a function that uses the leaf variable, so what it returns requires grad.
        # The function 'reconstruct' uses xi many times.
        # rec is the reconstruction, that requires grad. 
        # P is a parameter that is constant for each reconstruction
        # err is a numpy array.
        rec, err[j] = reconstruct(P, w[j], xi)
        loss_cur = loss_func(rec, obs[j])
        loss = loss + loss_cur

    return loss

if __name__ == "__main__":

    # x0 is a numpy array, let's say of size (D,N)
    # I give this functor to the scipy's LBFGS routine.
    def torch_func(x0):
        leaf = torch.from_numpy(x0, requires_grad=True)
        loss = forward_routine(leaf)
        grad = leaf.grad
        return loss.numpy(), grad.numpy()
    # Read input data
    # Run LBFGS, with torch_func as the function to minimize.

I know that lambda functions can’t be pickled so I wrote a function object for xi, but I didn’t add it to be short.