1. NVIDIA NeMo. (2024). NeMo Evaluator Documentation. The documentation details metrics for evaluating LLMs, including "correctness" (assessing factual accuracy against ground truth) and performance metrics like "latency" and "throughput." (Section: "Model Evaluation Metrics").
2. Gao, L., et al. (2023). Enabling Large Language Models to Generate Text with Citations. This paper on RAG evaluation emphasizes two key axes: "Faithfulness/Factual Consistency" (correctness) and "Relevance," while also considering performance metrics like response time (latency). (Section 4: "Evaluation").
3. Stanford University. (2023). CS25: Transformers United. Lecture on "LLM Evaluation." The course outlines key evaluation criteria for LLMs, highlighting "Accuracy" (correctness on specific tasks) and "Efficiency" (including latency and computational cost) as fundamental pillars.