Yeah so there’s a lot of things you can do to make CPU inference faster, sorted by increasing complexity
- Use an m6i or c6i instance they are muuuch faster
- Set
torch.set_num_threads(1)
in your script - Use Intel IPEX
- Use Intel CPU launcher script
Using these kinds of tricks I managed to get BERT inference from 2s to roughly 20ms
3 -4 are integrated in torchserve if you’re interested https://github.com/pytorch/serve