View Generative AI Engineer Associate Exam Questions

Q: 11

A Generative AI Engineer just deployed an LLM application at a digital marketing company that assists with answering customer service inquiries. Which metric should they monitor for their customer service LLM application in production?

Options

Discussion

Cameron V. Jan 27, 2026 8:44 pm

Maybe A but not 100 percent. If the LLM is actually deployed and serving real users, throughput metrics like customer inquiries per time make sense. But if there are strict SLAs on quality or latency, those might get monitored more in some orgs. Always feels like a trick when other operational or effectiveness metrics aren't options. Anyone else see a similar edge case?

Jack Jan 14, 2026 6:38 am

Not D, that's about model benchmarks not real-world usage. A makes more sense for monitoring production performance.

Karan Jan 13, 2026 9:31 pm

A over the rest here, since in production it's all about throughput and making sure the system can process user requests efficiently. Perplexity and leaderboard scores matter more during model evaluation, not once it’s live. I think A is right but open to other ideas if someone sees a catch.

Jamie D. Jan 18, 2026 10:33 pm

Nah, I don't think it's D. A is what you actually need to track for live ops-D just distracts with benchmark hype.

Drew N. Jan 26, 2026 12:10 pm

So why not B if someone cares about sustainability metrics? Isn't throughput (A) always the default for production ops, or are there exceptions with LLMs in customer service?

Quinn G. Jan 28, 2026 2:22 pm

Pretty sure it's A here. In production, tracking how many customer inquiries get handled per unit time tells you if the app's keeping up and if users are being served. B is interesting but more about operational costs. C and D are focused on model training or benchmarking, not live metrics. If I'm missing something, happy to hear other views.

Be respectful. No spam.

Correct Answer:

Explanation

When an LLM application is in production, monitoring its operational performance is critical. The "number of customer inquiries processed per unit of time" is a throughput metric. This metric directly measures the application's load, capacity, and efficiency in a live environment. It is essential for understanding user engagement, ensuring the system can handle real-world traffic, and planning for resource scaling. Monitoring throughput helps the engineering team maintain service level objectives (SLOs) and assess the application's operational health and business impact.

Why Incorrect

B. Energy usage per query: This is a low-level infrastructure or cost metric, not a primary indicator of the application's functional performance or business value in a customer service context.

C. Final perplexity scores for the training of the model: Perplexity is an intrinsic evaluation metric used during the model's training and development phase. It is static post-deployment and does not reflect live performance.

D. HuggingFace Leaderboard values for the base LLM: Leaderboard scores are used for pre-deployment model selection based on standardized benchmarks. They are not relevant for monitoring a specific, deployed application's performance on real-world, proprietary data.

---

References

1. Databricks Documentation, "LLMOps: Manage the complete LLM lifecycle": In the "Deploy & Monitor" section, the documentation emphasizes the need to monitor deployed LLMs for performance, drift, and quality. Performance monitoring explicitly includes metrics like latency and throughput (requests per second), which directly corresponds to the correct answer.

2. Databricks Blog, "A Guide to LLM Evaluation and Monitoring": This guide distinguishes between different types of metrics. Under "Operational metrics," it lists "requests per second (throughput)" and "latency" as key indicators to track for a model in production to ensure it meets performance requirements.

3. University of California, Berkeley, "Full Stack Deep Learning" Courseware: In the "MLOps & Tooling" lecture on Monitoring, the course material identifies key categories of metrics for production ML systems. "System Metrics" include throughput, latency, and resource utilization, which are fundamental for assessing the operational health of a deployed application. This aligns with monitoring inquiries processed over time.

Q: 12

A Generative Al Engineer is building a system which will answer questions on latest stock news articles. Which will NOT help with ensuring the outputs are relevant to financial news?

Options

Correct Answer:

Explanation

Increasing compute power primarily improves the processing speed (latency) of the system. However, speed is a performance metric, not a quality metric. A faster system does not inherently produce more relevant or accurate content. The relevance of the output is determined by the quality of the underlying model, the retrieval-augmented generation (RAG) process, and the specific algorithms used for relevance analysis, not the speed at which the computation is performed. A system can generate an irrelevant answer very quickly.

Why Incorrect

A. A guardrail framework with finance-specific content filters is a direct mechanism to ensure outputs stay on-topic and are relevant to the financial domain.

C. A profanity filter ensures the output's tone is appropriate for a professional context like financial news, which is a facet of maintaining domain-specific relevance and quality.

D. Manual reviews (human-in-the-loop) provide a direct and highly effective method to check, correct, and guarantee that every output is relevant before it is delivered to the end-user.

---

References

1. Databricks Documentation, "AI Guardrails with Databricks Lakehouse Monitoring": This document explains how to implement guardrails to control model outputs. It explicitly mentions "topical guardrails" which "check whether a model’s output is on topic." This directly supports the effectiveness of option A for ensuring relevance. It also discusses quality guardrails, which can encompass filters like those in option C.

2. Databricks Documentation, "Monitor quality and drift for LLMs": This guide details the metrics for monitoring LLM applications. It clearly separates performance metrics like latency from quality metrics like relevance (defined as "The model's responses are relevant to the query and the provided context"). This separation confirms that improving compute for speed (latency) does not directly improve content relevance.

3. Databricks Documentation, "Create a training dataset for model fine-tuning using Dolly": This page describes the process of collecting human-generated instruction-following records. This principle of using human judgment to create high-quality data is analogous to manual review (Option D) for correcting and ensuring the relevance of model outputs in a production system.

Q: 13

A Generative Al Engineer is building a system that will answer questions on currently unfolding news topics. As such, it pulls information from a variety of sources including articles and social media posts. They are concerned about toxic posts on social media causing toxic outputs from their system. Which guardrail will limit toxic outputs?

Options

Discussion

Luke O. Jan 19, 2026 11:00 pm

Option A was on my real exam. Limiting input sources to approved accounts is the core guardrail here since it stops toxic data before it even gets processed by the LLM. Pretty sure A is right, unless I'm missing something.

Sanjay P. Feb 1, 2026 4:31 pm

A , similar questions in the official guide focus on input filtering as the main guardrail. Blocking toxic sources up front is way more effective than just logging outputs or adding rate limits. Not completely sure if there's a newer best practice but all the recent practice sets lean toward A.

Robin S. Jan 14, 2026 5:32 pm

A here. Filtering inputs at the source comes before logging or rate limiting, so it actually blocks toxic content from even reaching the LLM. D looks like a trap since monthly checks are too late to prevent issues up front.

Mason D. Jan 18, 2026 9:27 am

Probably A. Only letting data from vetted sources through helps block toxic stuff before it ever gets to the LLM. Logging or rate limiting doesn't really prevent bad content in real time. Pretty sure A matches the usual best practice, agree?

Jack N. Jan 16, 2026 11:48 am

Could flip to D if the requirement was monitoring not prevention, but for actual guardrails it’s A.

Rowan L. Jan 18, 2026 6:08 pm

Its D. If we log all the LLM outputs and do a monthly toxicity analysis, we can catch any harmful trends and address them in batches. I think this way gives us historical insight, so it feels more robust to me. Disagree?

Be respectful. No spam.

Correct Answer:

Explanation

The most direct and proactive guardrail to prevent toxic source material from causing toxic outputs is to control the quality of the data ingested by the system. By using only approved, vetted social media and news accounts (an "allowlist" approach), the engineer curates the input data. This significantly reduces the likelihood of the Retrieval-Augmented Generation (RAG) system retrieving and incorporating toxic content into its context, thereby directly limiting the generation of toxic responses. This is a foundational step in building a safe and reliable generative AI application.

Why Incorrect

B. Implement rate limiting: Rate limiting controls the number of requests to prevent system abuse or manage costs; it does not analyze or filter the content for toxicity.

C. Reduce the amount of context Items the system will Include in consideration for its response: This is an unreliable method. A toxic but highly relevant post could still be selected within a smaller context window, failing to prevent the issue.

D. Log all LLM system responses and perform a batch toxicity analysis monthly: This is a reactive monitoring strategy, not a preventative guardrail. It identifies toxic outputs after they have already occurred, rather than limiting them in real-time.

References

1. Official Vendor Documentation (Databricks): In the guide for building high-quality RAG applications, the first and most critical step is data preparation. The documentation states, "The quality of your RAG application is depends on the quality of data that you provide it...This includes preprocessing and cleaning the data from your knowledge bases." This principle directly supports curating sources (Option A) as the primary method to control the quality and safety of the system's knowledge base.

Source: Databricks Documentation, "Build high-quality RAG applications," Section: "Data preparation."

2. Academic Publication: Research on responsible AI emphasizes controlling inputs as a key safety mechanism. A paper on building responsible generative AI applications highlights that "data selection and curation are paramount. Biased or toxic data can lead to harmful outputs... Employing rigorous data filtering and maintaining allowlists of trusted sources are essential preprocessing steps." This aligns with the strategy in Option A.

Source: Weidinger, L., et al. (2021). Ethical and social risks of harm from Language Models. Section 4.2: "Mitigation measures: Data." (This is a foundational paper on LLM safety, often cited in university AI ethics courses).

3. University Courseware: Stanford's CS224N course on Natural Language Processing with Deep Learning discusses the challenges of building safe NLP systems. Lecture materials on safety and ethics frequently cover the "garbage in, garbage out" problem, stressing that filtering and curating input data sources is a primary defense against generating harmful content, especially in systems that draw from dynamic sources like the web.

Source: Stanford University, CS224N: NLP with Deep Learning, Lecture on "Ethics in NLP." Course materials emphasize the importance of dataset quality and curation to mitigate harms.

Q: 14

What is an effective method to preprocess prompts using custom code before sending them to an LLM?

Options

Discussion

Karan Jan 30, 2026 4:51 pm

Makes sense to pick D here. MLflow PyFunc lets you slot in your own custom preprocessing before hitting the LLM, and it's the standard way for Databricks deployments. No need to mess with architecture or skip preprocessing, right? Open to correction if someone has seen this handled differently.

Jack E. Jan 19, 2026 7:37 am

Option D is it. Wrapping the custom preprocessing logic in an MLflow PyFunc model lets you run everything together and keeps things portable for Databricks deployment. I've seen similar questions in exam reports, but correct me if I'm missing something.

Ava X. Jan 24, 2026 11:27 pm

MLflow PyFunc is the go-to method here, so D. Saw this approach recommended in the official Databricks guide. If anyone saw another method used in recent exam practice, let me know.

Sara P. Feb 1, 2026 1:18 pm

I don’t think C is the way. Preprocessing the prompt up front usually affects how the LLM interprets your inputs, which can make a bigger difference than cleaning up the output afterward. Postprocessing helps but doesn’t fix issues caused by a poorly structured prompt. Pretty sure D is more standard but open to other takes.

Ajay Jan 17, 2026 7:39 pm

Yeah, D looks right to me. Using an MLflow PyFunc model gives you a clean way to bundle custom preprocessing steps with LLM calls, which is super handy for production pipelines in Databricks. Directly modifying the LLM architecture (A) is risky and not typical here. I think D but happy to be challenged if someone found a better approach.

Drew P. Jan 26, 2026 9:38 pm

D imo. If the question asked for best performance under extreme latency, maybe another answer, but for generic preprocessing this is standard.

Ben S. Jan 21, 2026 6:16 am

D is correct for Databricks flow. If the question asked for the most secure method, would that change the pick?

Be respectful. No spam.

Correct Answer:

Explanation

The most effective and standard method within the Databricks ecosystem for including custom code is to wrap the entire inference logic in an MLflow pyfunc model. This approach creates a self-contained, deployable artifact that includes not only the LLM call but also any custom preprocessing steps (like prompt templating, data retrieval, or formatting) and postprocessing logic. By defining a custom class that inherits from mlflow.pyfunc.PythonModel, you can implement a predict method that orchestrates these steps, ensuring consistency and reproducibility from development to production.

Why Incorrect

A. Directly modifying an LLM's internal architecture is highly impractical, often impossible for proprietary models, and is not a standard method for prompt preprocessing.

B. This is incorrect. Preprocessing prompts, also known as prompt engineering, is a fundamental and highly effective technique for guiding LLM behavior and improving output quality.

C. Postprocessing is a complementary, not a replacement, step. Effective preprocessing is crucial to guide the model to generate the correct type of output in the first place.

References

1. Databricks Official Documentation, "Log a custom Python function as an MLflow model": This document explicitly states, "MLflow’s pyfunc model flavor provides the flexibility to wrap any piece of Python code in an MLflow model... This includes, for example, any data preprocessing or post-processing logic that is coupled with your model’s inference." This directly supports using a pyfunc model to handle preprocessing.

2. Databricks Official Documentation, "Customize and deploy a chain with MLflow": This guide provides a concrete example of creating a custom class class CustomChain(mlflow.pyfunc.PythonModel): to wrap a LangChain object. The predict method within this class handles the input, which is then processed by the chain's prompt template (a form of preprocessing) before being sent to the LLM. This demonstrates the practical application of the pattern described in option D.

3. MLflow Official Documentation, mlflow.pyfunc: The documentation for the PythonModel class describes it as the standard way to create custom MLflow models from arbitrary Python code. The examples illustrate how to include custom logic within the predict function before and after the core model inference, which is the exact pattern needed for prompt preprocessing. (See section: "PythonModel").

Q: 15

A Generative Al Engineer is working with a retail company that wants to enhance its customer experience by automatically handling common customer inquiries. They are working on an LLM- powered Al solution that should improve response times while maintaining a personalized interaction. They want to define the appropriate input and LLM task to do this. Which input/output pair will do this?

Options

Correct Answer:

Explanation

The primary goal is to create an AI solution that automatically handles common customer inquiries. Using historical customer service chat logs as the input data source is ideal, as they contain a rich repository of real-world questions and their corresponding correct answers provided by human agents. The LLM's task is to perform a semantic search to find similar, previously answered questions within these logs and then synthesize or summarize the existing answers to respond to the new query. This approach, a core concept in Retrieval-Augmented Generation (RAG), directly addresses the need to provide fast, accurate, and contextually relevant answers to common questions.

Why Incorrect

A: This describes a user-level analytics task (aggregating ratings), not a system for answering specific customer inquiries in real-time.

B: This focuses on summarizing a single user's entire interaction history, which is useful for personalization but does not directly answer a new, general question.

D: This is a sentiment analysis task. While useful for understanding customer feedback, it does not provide answers to factual or procedural questions.

References

1. Databricks Documentation, "Build a Q&A chatbot with RAG": This guide details a common LLM use case: building a question-answering chatbot over a proprietary dataset. It states, "The general pattern for RAG is that you retrieve relevant documents from your knowledge base and provide those documents as context to the LLM for it to generate an answer." Customer service chat logs serve as an excellent knowledge base for this pattern. (Found in: Databricks Documentation > AI and Machine Learning > Generative AI and LLMs > RAG)

2. Databricks Blog, "Building a Customer Service Chatbot with Retrieval Augmented Generation (RAG) and Llama 2" (August 23, 2023): This article explicitly outlines the architecture for the scenario in the question. It explains, "RAG enhances LLMs by integrating an information retrieval system that provides ground-truth information from your own knowledge base... This knowledge base can be built from various sources, such as documents, a database, or a knowledge graph (e.g., FAQs, articles, customer support tickets)." This directly supports using chat logs (support tickets) to find and generate answers.

3. Lewis, P., et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. Advances in Neural Information Processing Systems, 33, 9459-9474: This foundational academic paper introduces the RAG model, which combines a pre-trained retriever with a generator. The retriever finds relevant documents (analogous to finding similar questions in chat logs), and the generator uses this information to produce the final output (the answer). This is the formal methodology described in the correct option. (DOI: Available through major academic search engines like arXiv:2005.11401)

Question 11 of 20 · Page 2 / 2

Premium Access Includes

✓ Quiz Simulator
✓ Exam Mode
✓ Progress Tracking
✓ Question Saving
✓ Flash Cards
✓ Drag & Drops
✓ 3 Months Access
✓ PDF Downloads

Get Premium Access

Premium Access Includes

FLASH OFFER

avail 10% DISCOUNT on YOUR PURCHASE