Best practice for slow customized nonlinearity

Hi guys. I have a customized nonlinearity f(x) that runs very slowly, and for each training/inference step I have to run f on a list of tensors (x_1,…,x_n). The good thing is x_1…x_n are disjoint so there is no tensor sharing complication and I think the best practice is to use multiprocessing here. From Multiprocessing best practices — PyTorch 1.8.1 documentation it seems that whenever the process_worker is called, a new process will be created, and I don’t think it suits my situation. Any suggestion for the best way to use multiprocess? Thanks.