Multiprocess function locally with rpc?

I noticed that a post-processing step in my model takes more than twice the time of the actual NN part. This post-processing involves sorting the outputs in each sample (sample by sample in a batch of 32+).

The sorting package is provided as “torchsort” and is already optimized as a C++ torchscript.
I thus tried to multiprocess each of the sample sortings in a batch using torch.multiprocessing, but of course receive the warning that: [...] autograd does not support crossing process boundaries..

So I wonder if I can “fake” distributed training locally somehow with rpc?

Would that be a good choice to solve my speed problems?

Could it be done easily?