So I have a unique situation where I’ve got access to a large cluster but no GPU. I’m currently running GPT-2 large without issue using CPU inference, the response time is less than two seconds which is acceptable for a conversational agent. GPT-2 xlarge clocks in at 4:30 minutes, so I am curious what options there may be with PyTorch 1.8 for optimizing CPU inference?
Each CPU has six (6) cores, does the stock PyTorch 1.8 distribution include multithreading support for CPU inference?
Are there any distributed inference options available now with this distributed training announcement?
Thanks in advance