CPU Parallelism for Cumulative Loops

This use case doesn’t exactly need distributed training, but you can try torch.jit.fork which is the main way to do task-level parallelism in PyTorch: torch.jit.fork — PyTorch 1.10.0 documentation.