I’m production-izing a transformer model deployed on a Vertex AI endpoint, but I’m running into issues that may be a limitation of the Vertex AI product.
The input is a single image and the output is a single image, however, the vertex ai endpoint fails due to the response size being too large, e.g.
ERROR: (gcloud.ai.endpoints.raw-predict) HTTP request failed. Response:
{
"error": {
"code": 400,
"message": "Response size too large. Received at least 31517591 bytes; max is 31457280.",
"status": "FAILED_PRECONDITION"
}
}
This seems like a similar limit on Cloud Run reqs/responses, so I figured I need to chunk/stream responses back from this endpoint.
I’m using the send_intermediate_predict_response
function to send chunked responses back, but it seems like this chunking only happens within torch serve, i.e. from its backend to its frontend and then the vertex ai endpoint attempts to send the entire response back.
Is there a way to get torch serve + vertex ai to return a chunked response?