The generic parallel_for
has a concept of _Partitioner
Microsoft parallel_for docs
Although I do not see provision for a partitioner in the at::parallel_for
of PyTorch: Parallel.h
Is there any way the range can be partioned unevenly for the threads?