PyTorch Dynamic Parallelism Effectiveness with Small Processes

Hi, I have an application where I am running a set of models (~10) on variable-batch-sized input. So I’m using dynamic parallelism since each of these models can be run in parallel. Each model’s batch is around 500 items. But after getting it running, I actually experience a significant slowdown to running them sequentially (just iterating over them). Is this expected in that the overhead for each process takes longer than the actual execution with this small data size?