CPU utilization freezes after a while (GPU works great)

Hello,

I have been trying to make a CPU accelerated inference for our model. The only thing I change from GPU to CPU is torch.set_num_threads(threads). I am running the model on AWS m5.24xlarge with 96 CPUs. I set the num_threads to 86 and num_workers to 8. It starts pretty well, giving an ETA of ~16hours but after 40% completion, most of the pytorch processes start jumping between “S-R” (stall and running) states. It got stuck at 47%, and I had to kill it.

Should I do something else for a CPU-accelerated version?

This is on Ubuntu 18.04 (the deeplearning AMI available from AWS)
Python3: 3.6.7
PyTorch: 1.1

The model is here: https://github.com/kishwarshafin/helen/blob/master/modules/python/models/TransducerModel.py

And the way I am doing is here:

Any help would be highly appreciated.