Recommended way of PyTorch model behind a high traffic REST API?

As title suggests, I am wondering what’s the right way to expose a PyTorch model that thousands of requests a second?

Would one write a REST API in C++ and serve model using TorchScript?
Or can one be lazy and serve through Python based Flask app (or something more resilient) and use PyTorch as is?

Or can one take the extra step and try to serve using GoLang based service and somehow use a model saved as TorchScript? (If so, how would one leverage GPU based inference?

What are the pros cons of the above methods?

Technically, it depends on the size of the model you are using… But there’s a reason people use C++ in production: python just isn’t fast enough for that kind of traffic.

I’ve done a classification demo website in the past, python/pytorch-only, with Flask or Bottle or something like that, and requests took a couple of seconds to be processed (it was a ResNet-34 model, with input size close to ImageNet’s, run on a 1080Ti).

I’d recommend going by iteration: starting with python-only code, then switch to C++ if it’s too slow. Don’t know about GoLang, though!

1 Like