1. NVIDIA Technical Blog. In "Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation," it is stated: "Quantization helps to reduce the model size... It also helps to reduce the amount of memory and cache used to store weights and activations... This leads to reduced latency and power consumption." This directly supports options A and D.
Source: NVIDIA Developer Blog, "Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation," May 11, 2021.
2. NVIDIA TensorRT Developer Guide. The guide explains the benefits of using lower precision for inference: "Memory usage is reduced, allowing for the deployment of larger networks... Data movement is reduced, leading to lower power consumption and higher throughput." This confirms that quantization saves memory and power.
Source: NVIDIA TensorRT 8.6 Developer Guide, Section 2.3. "Working With INT8."
3. Peer-Reviewed Academic Publication. A comprehensive survey on quantization states its primary benefits: "(1) a reduction in memory footprint and cache usage, (2) a reduction in memory bandwidth, (3) a reduction in computational cost, and (4) a reduction in power consumption." This publication validates both A and D as key advantages.
Source: Gholami, A., et al. (2021). "A Survey of Quantization Methods for Efficient Neural Network Inference." arXiv:2103.13630, Section 2: "Benefits of Quantization," page 3.
4. University Courseware. Stanford's course on Convolutional Neural Networks explains that model compression techniques like quantization reduce the number of bits per weight, which "saves storage/memory" and makes models "more energy efficient."
Source: Stanford University, CS231n: Convolutional Neural Networks for Visual Recognition, Spring 2023, Lecture 14 notes on "Model Compression."