Parallelize a for loop (single CPU, multi-cores)

Hi,

I would like to parallelize a for loop inside my model for training on a single CPU but many cores.
To be more precise, at some point, I have something that looks like

def forward(self, x):
    # ...
    ys = []
    for latent in latents:
        ys.append( self.submodel(latent) )
    # do something useful with ys
    # ...

I tried to solve my problem with Pool from torch.multiprocessing but got the error message

grad_fn=<AddmmBackward0>)'. Reason: 'RuntimeError('Cowardly refusing to serialize non-leaf tensor which requires_grad, since autograd does not support crossing process boundaries. If you just want to transfer the data, call detach() on the tensor before serializing (e.g., putting it on the queue).')'

I would appreciate any help.
Thanks.