Logging predictions/monitoring at inference time

I am currently in the process of setting up model monitoring for models served with torchserve on Kubernetes. Ideally, I would like to store input and output images for later manual prediction inspection.

I was wondering what would be the best way to achieve such a setup in a custom handler:

  • Dump the preprocessd image and the model output every now and then in the handlers’ inference method (not to impact inference performances too heavily with such an expensive operation). In this scenario the consumer should provide a prediction_id so that the request data can later be reconciled with any user feedback or user generated event.
    def inference(self, input):
        img, pred_id = input
        output = self.model(input)

        # Only log 20% of requests (example)
        if random.random() < 0.2:
            dump_to_gcs(pred_id, input)
            dump_to_gcs(pred_id, output)

        return output
  • Let the conusmer log inputs/ouputs as part of the request handling. Couple pros/cons are:
    • pro: being done outside the infrerence container, there will be no impact on performances.
    • con: raw input would be stored - this means that for models with heavy preprocessing before inference, the same transformations should be applied to the stored input data before being usable (especially relevant for models based on tabular data).

As expected, the usual array of checks, validations and alerts will be built upon the collected data to inform on retrain needs and different kind of drift.

I was wondering what the shared approach would look like in this case and what kind of gotchas one should be mindful of.

For our custom metrics we’ve chosen to opt for having metrics instrumented in the handler https://github.com/pytorch/serve/blob/master/docs/metrics.md

FWIW I think it’s fine to take a performance hit for logging, it’s generally not as bad as people think and will save you a lot more time in the long run if you’re debugging issues in prod

Thanks a lot for your reply, I think that metrics, custom or not make a lot of sense. On the other hand, some models I am working with will output an image and for those I’ll probably still go for dumping the output image to a bucket in the handlers’ inference method.

I guess this opens up couple issues from my side:

  1. It is probably better to ingest raw input/outputs in order to make it easier for future models to integrate that data as part of training data (even with different preprocessing pipelines).
  2. Storing raw input/outputs would force me to access preprocessing/postprocessing functions at monitoring time (AWS follows a similar approach in SageMaker) in order to make inference data comparable with what was fed to the model during trainig.
  3. By logging only a subsample of observation in production I might run into situations in which the user has provided some ground truth label but the features fed to the models are missing, guess I’d just have to live with that if I don’t want to log each and every prediction.