Speeding up inference on CPU with a complex and dense model

The model I am using is extremely complex and does not just pass data between different pytorch models. In between various math operators are performed and reshapes of the data. When I applied the pytorch pruning (1.4) I was able to maintain acceptable accuracy at like 60% sparsity - unfortunately model pruning isn’t much farther than the research phase. I tried quantization but the support wasn’t good enough and would require me to quant and dequant very frequently.

Are there any other techniques for speeding up my model inference time on CPU that I could try? Or is it better to just re-architect the entire thing to flow the data better with quantization in mind?