I noticed that a post-processing step in my model takes more than twice the time of the actual NN part. This post-processing involves sorting the outputs in each sample (sample by sample in a batch of 32+).
The sorting package is provided as “torchsort” and is already optimized as a C++ torchscript.
I thus tried to multiprocess each of the sample sortings in a batch using
torch.multiprocessing, but of course receive the warning that:
[...] autograd does not support crossing process boundaries..
So I wonder if I can “fake” distributed training locally somehow with rpc?
Would that be a good choice to solve my speed problems?
Could it be done easily?