Speed up ensemble model by parallelizing the blocks

I am trying to speed up an algorithm responsible for producing 3d skeleton joints from 2D images. The algorithm (GAST-NET) consists of 4 main blocks running sequentially for every frame. I’m trying to parallelize the 4 blocks. I have some questions regarding the process.

  1. Will parallelization help speed up the algorithm? I am trying to parallelize on one GPU only.
  2. What other ways can I look into that can help with speeding up the algorithm?
  3. Slightly related question, Isn’t PyTorch already trying to use maximum GPU resources to produce output as fast as possible? I monitored the GPU utilization and it was between 38 and 50%. Is there a way I can ensure that the GPU is used to the fullest?

This really depends on the algorithm.

If you could share some sample PyTorch code for the algorithm, I can look into it to see if there are some optimization opportunities.