1. Kreuzberger, D., et al. (2023). Machine Learning Operations (MLOps): Overview, Definition, and Architecture. IEEE Access, 11, 31756-31775. In Section IV-A, "Model Evaluation," the authors emphasize that "evaluation of ML models is a multi-faceted problem" and that a variety of metrics and techniques are necessary for a comprehensive assessment, especially in continuous monitoring post-deployment. (DOI: https://doi.org/10.1109/ACCESS.2023.3262138)
2. Stanford University. (n.d.). CS229 Machine Learning Course Notes: Evaluation Metrics. The course materials discuss the importance of using metrics beyond simple accuracy for classification problems, particularly the precision-recall trade-off. For applications like fraud detection, evaluating this trade-off is critical to understanding business impact, which requires multiple evaluation points, not a single metric. (See discussion on Precision, Recall, and F1-score in the course's public materials).
3. Saleh, M. (2022). MLOps: The Ultimate Guide. Towards Data Science. While not a formal academic paper, this guide, widely referenced in data science curricula, explains that robust model evaluation in production (a core MLOps principle) involves a combination of techniques including A/B testing, canary deployments, and monitoring a dashboard of metrics, not just one. This aligns with the need for a diverse set of validation techniques.