Sorry if this sounds a bit noob — I’m still new to deploying deep learning models on edge devices.
I’ve been reading a lot of academic papers and what I keep seeing is that, they only report latency or FPS when they talk about real-time performance on the device. But I do not see any predictive metrics like accuracy, precision, or recall reported on-device during deployment.
My question is:
Why don’t we just take a small chunk of the test set (isolated before the training), run it directly on the edge device, and evaluate the predictive performance while the model is running on that hardware? That seems like it would give us the most realistic measure of the model’s actual performance in deployment. Is this approach:
- Not standard practice?
- Technically difficult or even impossible?
- Considered meaningless or unnecessary?
And more generally — what is the standard process here?
Is it:
- Train and test the model locally (with full evaluation metrics),
- Deploy the model on the device,
- Then only measure latency/FPS on-device — and nothing about predictive accuracy?
My expectation would be to see the same predictive accuracy on the training and deployment device. I.e. even if different algorithms, kernels, etc. are used causing the expected numerical mismatches due to the limited floating point precision, I would not expect to see any difference in the actual prediction unless the implementation is broken.
This is a really insightful question — definitely not noob at all. You’re right to wonder why on-device predictive performance isn’t more commonly reported, especially since deployment environments can introduce subtle changes (e.g., quantization errors, hardware-specific ops). In practice, many researchers assume that accuracy doesn’t change post-training, especially if the model architecture and weights remain untouched. But when quantization or other optimizations are involved, it absolutely makes sense to evaluate a held-out test set on the actual device.
That said, the standard workflow does tend to separate model evaluation (done offline) and deployment benchmarking (focused on latency/FPS/power). This is mostly due to practicality and reproducibility — not because it’s meaningless to evaluate on-device accuracy. In real-world applications, teams often do exactly what you described: send a test batch through the deployed model and log the results for validation.
So no, it’s not meaningless — just underreported in academic papers. Great that you’re thinking about this!
Absolutely agree — prediction consistency between training and deployment is a fundamental expectation. Minor floating-point differences are understandable due to hardware or kernel variations, but if the outputs are diverging beyond tolerance, it’s usually a sign of something deeper like preprocessing mismatches, model export issues, or incorrect inference configuration.