Streaming/chunking responses using TorchServe on Vertex AI

Eric_Sanchez · July 12, 2023, 11:28pm

I’m production-izing a transformer model deployed on a Vertex AI endpoint, but I’m running into issues that may be a limitation of the Vertex AI product.

The input is a single image and the output is a single image, however, the vertex ai endpoint fails due to the response size being too large, e.g.

ERROR: (gcloud.ai.endpoints.raw-predict) HTTP request failed. Response:
{
  "error": {
    "code": 400,
    "message": "Response size too large. Received at least 31517591 bytes; max is 31457280.",
    "status": "FAILED_PRECONDITION"
  }
}

This seems like a similar limit on Cloud Run reqs/responses, so I figured I need to chunk/stream responses back from this endpoint.

I’m using the send_intermediate_predict_response function to send chunked responses back, but it seems like this chunking only happens within torch serve, i.e. from its backend to its frontend and then the vertex ai endpoint attempts to send the entire response back.

Is there a way to get torch serve + vertex ai to return a chunked response?

Eric_Sanchez · July 21, 2023, 5:22am

This doesn’t seem possible. The solution for this was to write results to GCS and serve results from GCS.

JamesTrick · July 30, 2023, 9:43pm

Hi Eric

Google cloud has a limit of 32MB as you’ve found. The best way to ‘increase’ this limit would be to use GCS and signed requests. Albeit in javascript, this blog post is pretty helpful at explaining it.

Ensure you’ve also increased the config within torchserve to handle the larger file size too, docs here.

Nedyali · September 29, 2023, 5:39pm

Perhaps consulting with a professional experienced with VertexAI will help resolve this issue https://tech-stack.com/services/big-data-and-analytics