Parallelizing model heads?

I have a model with a large number of simple heads (> 100). Currently, at the final stage of the forward call, I have to iterate through all the heads and append the results into a list, which is quite slow and underutilizes my hardware. Is there a way to parallelize running the heads? I think tensorflow does this automatically since everything is built into the graph but I can’t seem to find any equivalent functionality in Pytorch.