Question 17 - NCP-AAI Exam Dumps 2026 – NVIDIA Agentic AI Professional Cert

Q: 17

A multi-modal field inspection agent must process text notes, photos, and form values to generate a report. Which design is MOST suitable?

Options

Correct Answer:

Explanation

The task explicitly requires processing multiple data types: text, photos (images), and form values (structured data). This is a classic multi-modal problem. A multi-modal inference pipeline is the only design that can ingest and synthesize information from these different sources. Furthermore, generating a "report" implies a need for organized, coherent output, making structured output generation a key component of the ideal solution. This approach directly addresses all stated requirements.

Why Incorrect

Pure text-only completion without visual understanding fails the core requirement of processing photos, a critical input modality for a field inspection.

Static OCR-free keyword matching only is far too simplistic; it cannot process images and lacks the generative capability to create a coherent report.

Audio generation pipeline only is entirely irrelevant to the task, which involves processing visual/textual inputs and generating a text-based report.

References

1. NVIDIA NeMo Framework Documentation. "Multimodal Models." The documentation details models like NeVA (Nemo Vision-and-Language Assistant) designed to understand both images and text to generate relevant textual responses, which is the exact capability needed.

2. Stanford University. (Spring 2024). "CS231n: Deep Learning for Computer Vision," Lecture on Vision and Language. The course covers architectures that combine vision (CNNs/ViTs) and language models to perform tasks like visual question answering, which is analogous to the agent's required function.

3. Radford, A., et al. (2021). "Learning Transferable Visual Models From Natural Language Supervision." Proceedings of the 38th International Conference on Machine Learning. This paper on CLIP demonstrates the power of joint image-text understanding, which is the foundation for the multi-modal pipeline required in the question.

Premium Access Includes

FLASH OFFER

avail 10% DISCOUNT on YOUR PURCHASE