Question about PyTorch 1.8 performance optimization for CPU inference

jrandel · March 6, 2021, 1:57am

So I have a unique situation where I’ve got access to a large cluster but no GPU. I’m currently running GPT-2 large without issue using CPU inference, the response time is less than two seconds which is acceptable for a conversational agent. GPT-2 xlarge clocks in at 4:30 minutes, so I am curious what options there may be with PyTorch 1.8 for optimizing CPU inference?

Each CPU has six (6) cores, does the stock PyTorch 1.8 distribution include multithreading support for CPU inference?

Are there any distributed inference options available now with this distributed training announcement?

Thanks in advance

marksaroufim · February 10, 2022, 10:24pm

Just answered this here Inference extremely slow on AWS (CPU)? - #2 by marksaroufim