1. NVIDIA NeMo Framework Documentation. "Multimodal Models." The documentation details models like NeVA (Nemo Vision-and-Language Assistant) designed to understand both images and text to generate relevant textual responses, which is the exact capability needed.
2. Stanford University. (Spring 2024). "CS231n: Deep Learning for Computer Vision," Lecture on Vision and Language. The course covers architectures that combine vision (CNNs/ViTs) and language models to perform tasks like visual question answering, which is analogous to the agent's required function.
3. Radford, A., et al. (2021). "Learning Transferable Visual Models From Natural Language Supervision." Proceedings of the 38th International Conference on Machine Learning. This paper on CLIP demonstrates the power of joint image-text understanding, which is the foundation for the multi-modal pipeline required in the question.