View MLA-C01 Exam Questions

Q: 11

HOTSPOT A company stores historical data in .csv files in Amazon S3. Only some of the rows and columns in the .csv files are populated. The columns are not labeled. An ML engineer needs to prepare and store the data so that the company can use the data to train ML models. Select and order the correct steps from the following list to perform this task. Each step should be selected one time or not at all. (Select and order three.) • Create an Amazon SageMaker batch transform job for data cleaning and feature engineering. • Store the resulting data back in Amazon S3. • Use Amazon Athena to infer the schemas and available columns. • Use AWS Glue crawlers to infer the schemas and available columns. • Use AWS Glue DataBrew for data cleaning and feature engineering.

Your Answer

Discussion

Jason Mar 1, 2026 2:51 pm

Use AWS Glue crawlers, then Glue DataBrew, then store the results in S3.

Ishaan Feb 21, 2026 2:22 pm

Glue crawler, DataBrew, then store in S3. Official guide covers this flow well if you want the detailed why.

Ethan F. Feb 22, 2026 11:11 am

Not quite, it's not Athena first. You need to use Glue crawlers up front to actually infer schema since the files lack headers and are pretty sparse. Athena is more for querying, so that's a trap here. So I'd say: Glue crawler, then DataBrew for cleaning/features, finally save back to S3. Do folks agree?

Maya U. Feb 18, 2026 11:01 pm

Glue crawler, then DataBrew, and finally put the results back in S3. Saw something really similar on a practice test-the key is you have to infer schema first since those CSVs are missing headers. Pretty sure this is what AWS expects here.

Ravi Mar 3, 2026 9:40 am

Glue crawler, DataBrew, then save to S3. Athena's a common distractor here since it's usually for querying not schema inference.

HelpfulOps6673 Feb 19, 2026 11:28 am

Glue crawler, DataBrew, S3 in that order.

Jason Feb 16, 2026 12:59 pm

Glue crawler, then DataBrew for cleaning, and save the result in S3. Saw a similar flow in actual exam reports.

Luna M. Feb 23, 2026 4:11 am

Looks good to me, pretty consistent with practice and exam reports I’ve seen. Glue crawlers to infer schema first, DataBrew for transformation, save back to S3. I’d check the official AWS docs or do a hands-on lab if you want extra clarity.

Zoe X. Feb 20, 2026 5:14 pm

Athena, then SageMaker batch transform, then save to S3.

Be respectful. No spam.

Correct Answer:

1. USE AWS GLUE CRAWLERS TO INFER THE SCHEMAS AND AVAILABLE COLUMNS. 2. USE AWS GLUE DATABREW FOR DATA CLEANING AND FEATURE ENGINEERING. 3. STORE THE RESULTING DATA BACK IN AMAZON S3.

Explanation

The most logical and efficient workflow begins with understanding the data's structure. AWS Glue crawlers are designed specifically to scan data stores like Amazon S3, automatically infer schemas from unstructured or semi-structured data (like the described .csv files), and populate the AWS Glue Data Catalog. Once the schema is defined, AWS Glue DataBrew provides a visual interface to clean, normalize, and perform feature engineering on the dataset. It is the ideal tool for handling sparse, unlabeled data without writing extensive code. Finally, the prepared, clean dataset must be stored. Amazon S3 is the standard and most integrated storage service for machine learning workflows on AWS, making the processed data readily available for model training with services like Amazon SageMaker.

References

1. AWS Glue Crawlers: The AWS Glue Developer Guide states, "You can use a crawler to populate the AWS Glue Data Catalog with tables. This is the primary method used by most AWS Glue users." A crawler automatically discovers the schema of your data.

Source: AWS Glue Developer Guide, "Defining Crawlers".

URL: https://docs.aws.amazon.com/glue/latest/dg/add-crawler.html

2. AWS Glue DataBrew: The AWS Glue DataBrew Developer Guide explains, "AWS Glue DataBrew is a visual data preparation tool that you can use to clean and normalize data... You can then use this prepared data for analytics and machine learning."

Source: AWS Glue DataBrew Developer Guide, "What Is AWS Glue DataBrew?".

URL: https://docs.aws.amazon.com/databrew/latest/dg/what-is-databrew.html

3. Storing Data in Amazon S3: The Amazon SageMaker Developer Guide specifies that for training a model, "You store training data in an Amazon S3 bucket." This establishes S3 as the standard location for storing datasets prepared for ML training.

Source: Amazon SageMaker Developer Guide, "Train a Model with Amazon SageMaker".

URL: https://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-training.html

Q: 12

HOTSPOT An ML engineer is working on an ML model to predict the prices of similarly sized homes. The model will base predictions on several features The ML engineer will use the following feature engineering techniques to estimate the prices of the homes: • Feature splitting • Logarithmic transformation • One-hot encoding • Standardized distribution Select the correct feature engineering techniques for the following list of features. Each feature engineering technique should be selected one time or not at all (Select three.)

Your Answer

Discussion

Jamie V. Feb 28, 2026 8:31 pm

AWS always wants textbook preprocessing. For this, it's: CITY (NAME): ONE-HOT ENCODING TYPE_YEAR: FEATURE SPLITTING SIZE OF BUILDING: LOGARITHMIC TRANSFORMATION

Anita Y. Feb 15, 2026 4:16 pm

Had something like this in a mock, it's CITY (NAME): ONE-HOT ENCODING TYPE_YEAR: FEATURE SPLITTING SIZE OF BUILDING: LOGARITHMIC TRANSFORMATION.

Mia Feb 18, 2026 4:18 pm

I see similar logic in official practice tests. CITY (NAME): ONE-HOT ENCODING, TYPE_YEAR: FEATURE SPLITTING, SIZE OF BUILDING: LOGARITHMIC TRANSFORMATION

Jason O. Feb 25, 2026 3:39 pm

Would there be any case where you'd standardize SIZE OF BUILDING over log transform if the data wasn't skewed?

Nathan Y. Feb 18, 2026 5:49 am

CITY (NAME): ONE-HOT ENCODING, TYPE_YEAR: FEATURE SPLITTING, SIZE OF BUILDING: LOGARITHMIC TRANSFORMATION. This makes sense since city is categorical (needs one-hot), type_year needs to be split for the model, and size gets log transform because real estate features are often skewed. I think that's what AWS expects here but let me know if you see it differently.

Parker E. Feb 21, 2026 5:03 am

CITY (NAME): ONE-HOT ENCODING TYPE_YEAR: FEATURE SPLITTING SIZE OF BUILDING: LOGARITHMIC TRANSFORMATION, saw this mapping on other exam reports.

Reese Z. Feb 27, 2026 11:02 am

CITY (NAME): ONE-HOT ENCODING, TYPE_YEAR: FEATURE SPLITTING, SIZE OF BUILDING: LOGARITHMIC TRANSFORMATION. These match standard practice-categoricals get one-hot, composite split, skewed numerics log. Pretty sure this is what AWS is after here.

Riley P. Feb 24, 2026 12:10 pm

CITY (NAME): ONE-HOT ENCODING, TYPE_YEAR: FEATURE SPLITTING, SIZE OF BUILDING: LOGARITHMIC TRANSFORMATION

Sanjay Feb 26, 2026 9:06 pm

Hmm, I’d actually go with standardized distribution for SIZE OF BUILDING instead of logarithmic transformation. Usually log is best if the data is highly skewed, but unless they specify that in the question, standardizing is common for numeric features. CITY: one-hot encoding and TYPE_YEAR: feature splitting still make sense. Not 100% sure though, maybe overlooked a clue in the image?

Owen C. Feb 20, 2026 1:25 pm

CITY (NAME): ONE-HOT ENCODING TYPE_YEAR: FEATURE SPLITTING SIZE OF BUILDING: LOGARITHMIC TRANSFORMATION. Official doc and exam guides match this mapping.

Be respectful. No spam.

Correct Answer:

CITY (NAME): ONE-HOT ENCODING TYPE_YEAR (TYPE OF HOME AND YEAR THE HOME WAS BUILT): FEATURE SPLITTING SIZE OF THE BUILDING (SQUARE FEET OR SQUARE METERS): LOGARITHMIC TRANSFORMATION

Explanation

The selection of each technique aligns with standard machine learning data preprocessing practices.

City (name) is a nominal categorical feature. One-hot encoding is the appropriate method to convert these non-ordinal categories into a numerical format that an ML model can process, creating a separate binary feature for each city.

Type_year is a composite feature containing two distinct pieces of information: the home type and its construction year. Feature splitting is necessary to separate these into two independent, more useful features (type_of_home and year_built).

Size of the building is a continuous numerical feature. In real estate, features like size and price are often right-skewed. A Logarithmic transformation is used to handle this skewness, making the distribution more symmetric and helping to meet the assumptions of certain models like linear regression.

References

One-hot encoding: Géron, A. (2022). Hands-On Machine Learning with Scikit-Learn, Keras & TensorFlow (3rd ed.). O'Reilly Media, Inc. In Chapter 2, "Handling Text and Categorical Attributes" (pp. 71-73), the author explains and demonstrates the use of OneHotEncoder for nominal categorical features, which is the exact case for the City (name) feature.

Feature splitting: Stanford University. (n.d.). CS229: Machine Learning - Unsupervised Learning, K-means clustering. In the context of feature engineering, it's a common practice to decompose features. While not a formal named method, splitting a feature like Type_year into Type and Year is a fundamental feature engineering step to create more granular and meaningful inputs for the model. This is conceptually similar to extracting the month or day from a date feature.

Logarithmic transformation: University of California, Berkeley. (n.d.). Data 8: Foundations of Data Science, Chapter 15.2: Transformations. The course material explains that applying transformations like the logarithm is useful when dealing with skewed distributions or non-linear relationships, which is highly characteristic of housing size and price data. Applying a log transform can help linearize the relationship and stabilize variance.

Q: 13

HOTSPOT A company wants to host an ML model on Amazon SageMaker. An ML engineer is configuring a continuous integration and continuous delivery (Cl/CD) pipeline in AWS CodePipeline to deploy the model. The pipeline must run automatically when new training data for the model is uploaded to an Amazon S3 bucket. Select and order the pipeline's correct steps from the following list. Each step should be selected one time or not at all. (Select and order three.) • An S3 event notification invokes the pipeline when new data is uploaded. • S3 Lifecycle rule invokes the pipeline when new data is uploaded. • SageMaker retrains the model by using the data in the S3 bucket. • The pipeline deploys the model to a SageMaker endpoint. • The pipeline deploys the model to SageMaker Model Registry.

Your Answer

Discussion

Karan J. Mar 2, 2026 9:34 am

1. An S3 event notification invokes the pipeline, 2. SageMaker retrains the model using S3 data, 3. Deploy to a SageMaker endpoint.

Ivy Z. Feb 16, 2026 3:50 am

Makes sense to kick off the pipeline with the S3 event notification, retrain with SageMaker, then deploy right to the endpoint. That's typical in these MLOps flows unless there's a versioning or review step in the requirements.

Neha S. Feb 26, 2026 7:44 pm

1. An S3 event notification invokes the pipeline when new data is uploaded
2. SageMaker retrains the model by using the data in the S3 bucket
3. The pipeline deploys the model to a SageMaker endpoint

Had something like this in a mock. This order makes sense because S3 events are used to trigger automation, then retraining, and finally deploy the fresh model to an endpoint for inference. Pretty sure this is what they want here.

Reese E. Feb 17, 2026 1:56 pm

Don't think it's Model Registry, that's the distractor here. Trigger, retrain, deploy to endpoint.

Drew J. Feb 19, 2026 3:31 am

Does this pipeline really need Model Registry if the question says nothing about approvals or versioning? I always see the sequence go S3 event triggers pipeline, then SageMaker retrains, and finally deploy to endpoint for automated inference-not Registry. Do you all agree the three steps should skip the Registry step?

Sam Feb 21, 2026 12:30 pm

S3 event notification, SageMaker retrains, then deploy to endpoint. Model Registry not needed here without versioning in requirements.

Hannah F. Feb 14, 2026 6:12 am

S3 event triggers the pipeline, retrain with SageMaker, then deploy to endpoint. That's the usual automated flow for new data hitting S3. Pretty sure this sequence is right unless they specify something about versioning.

KaranX Feb 18, 2026 11:07 am

Wow, AWS loves to bury you in their services for these pipelines. The right order is: S3 event notification triggers the pipeline when new data lands, SageMaker does the retraining, then you push the model to a SageMaker endpoint for inference. Pretty standard MLOps pattern here, unless I'm missing something sneaky in their options.

CarefulAnalyst6538 Feb 26, 2026 10:01 am

Hmm, I was thinking it's S3 event, retrain in SageMaker, then push to Model Registry.

Jason E. Feb 27, 2026 10:39 am

Looks like practice exams and the official study guide point to Model Registry as a checkpoint before endpoint deployment.

Be respectful. No spam.

Correct Answer:

1. AN S3 EVENT NOTIFICATION INVOKES THE PIPELINE WHEN NEW DATA IS UPLOADED. 2. SAGEMAKER RETRAINS THE MODEL BY USING THE DATA IN THE S3 BUCKET. 3. THE PIPELINE DEPLOYS THE MODEL TO A SAGEMAKER ENDPOINT.

Explanation

This sequence represents a standard MLOps continuous training and deployment (CI/CD) workflow. The process is initiated by a trigger when new data arrives. An Amazon S3 event notification is the correct mechanism to detect the new data upload and invoke a downstream process like an AWS CodePipeline, typically via Amazon EventBridge. Once triggered, the pipeline's first logical action is to use the new data to retrain the model with a SageMaker training job. After successful retraining and evaluation (an implicit step), the final action is to deploy the updated model artifact to a SageMaker endpoint to make it available for real-time inference, thus completing the automated deployment cycle.

References

1. AWS SageMaker Developer Guide: Describes using events to automate pipelines. "You can automate the running of your pipelines based on events by using Amazon EventBridge. You can create an EventBridge rule that initiates your pipeline when, for example, the state of a resource changes, such as when a new object is added to an Amazon S3 bucket." This supports Step 1 as the trigger.

Source: AWS SageMaker Developer Guide, "Starting a Pipeline Execution".

URL: https://docs.aws.amazon.com/sagemaker/latest/dg/pipelines-automating-pipelines.html

2. AWS Whitepaper - MLOps Foundation on AWS: This document outlines the MLOps lifecycle, showing that after a trigger, the pipeline executes model building (training) and model deployment stages. "The model build pipeline is triggered automatically... The pipeline gets the latest version of the curated dataset to train the ML model... After the model is validated, the pipeline registers it to the model registry. The model deploy pipeline automatically picks up the new model... and deploys it". This supports the sequence of training (Step 2) followed by deployment (Step 3).

Source: AWS Whitepapers & Guides, "MLOps foundation on AWS", Page 10.

URL: https://d1.awsstatic.com/whitepapers/mlops-foundation-on-aws.pdf

3. AWS SageMaker Developer Guide: Explains the final deployment step. "After you train a model, you can deploy it to get predictions... Amazon SageMaker hosting services deploys your model to a SageMaker endpoint and gives you a secure and scalable endpoint that you can use for inference." This confirms that deploying to an endpoint is the action that makes the model operational for inference.

Source: AWS SageMaker Developer Guide, "Deploy a Model to SageMaker Hosting Services".

URL: https://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-deployment.html

Q: 14

An ML engineer is using Amazon SageMaker to train a deep learning model that requires distributed training. After some training attempts, the ML engineer observes that the instances are not performing as expected. The ML engineer identifies communication overhead between the training instances. What should the ML engineer do to MINIMIZE the communication overhead between the instances?

Options

Discussion

Layla X. Feb 13, 2026 9:56 pm

C . Keeping training instances and data in the same AZ really cuts network latency for distributed jobs. Official AWS ML guide and practice exams talk about minimizing cross-AZ traffic for this exact reason, so pretty confident here.

Avery F. Mar 4, 2026 10:40 pm

Option C is it. Keeping compute and data in the same AZ (and subnet) really reduces network latency for distributed ML jobs, which exam guides and AWS whitepapers drill on. I remember labs where any cross-AZ setup added delays fast. Pretty sure about this but open to other interpretations if someone's seen different in practice.

PracticalAuditor3101 Feb 21, 2026 6:50 am

Option C

Daniel Q. Mar 5, 2026 1:26 pm

C. not D

Daniel T. Feb 14, 2026 2:10 pm

C not D. Only C puts both compute and data in the same AZ, so network latency is lowest. Pretty sure that's what matters most for distributed training here. If someone disagrees let me know.

Jordan N. Mar 1, 2026 8:19 am

Anyone checked the official doc or tried labs for this scenario?

Jamie F. Mar 1, 2026 10:16 am

Kevin X. Feb 14, 2026 3:18 pm

Feels like C, since same AZ for both compute and data removes cross-AZ latency which can slow down distributed training. D looks tempting if you think about fault tolerance but not right if overhead is the main concern. Correct me if you see it differently.

Sam Y. Mar 2, 2026 10:42 am

A is wrong, C. If you want the lowest communication overhead for distributed training, data and compute need to be in the same AZ. That avoids cross-AZ latency and extra costs. Pretty sure C lines up with AWS ML best practices here, but let me know if you think otherwise.

Ishaan K. Feb 25, 2026 3:04 am

C tbh, seen this same scenario in AWS exam guides and labs.

Be respectful. No spam.

Correct Answer:

Explanation

To minimize communication overhead for distributed training, both compute instances and data should be as close as possible. Placing training instances in the same VPC subnet ensures they reside within a single Availability Zone (AZ). This configuration provides the lowest possible network latency for inter-instance communication, which is critical for synchronizing gradients and model parameters in distributed training. Storing the training data in an Amazon S3 bucket within the same AWS Region and Availability Zone as the instances minimizes data access latency, further reducing the overall training time. This combined strategy is the most effective way to reduce the total communication overhead.

Why Incorrect

A: Storing data in a different AWS Region introduces significant network latency for data retrieval, which would drastically increase, not minimize, overall overhead.

B: A single VPC subnet cannot span multiple Availability Zones, making this option technically invalid. Furthermore, cross-Region data storage is highly inefficient.

D: Storing data in a different Availability Zone, while better than a different Region, still incurs higher data transfer latency compared to storing it in the same AZ as the instances.

References

1. AWS VPC User Guide - Subnets: "When you create a subnet

you specify the CIDR block for the subnet and the Availability Zone in which to create the subnet. A subnet must reside entirely within one Availability Zone and cannot span zones." This confirms that placing instances in the same subnet places them in the same AZ.

Source: AWS Documentation

VPC User Guide

"Subnets for your VPC"

Section: "Subnet basics".

2. AWS SageMaker Developer Guide - Best Practices: "To reduce data transfer time

we recommend that you store your data in an Amazon S3 bucket in the same AWS Region that you use for your training job." While this specifies the Region

the principle of data locality extends to the Availability Zone for optimal performance.

Source: AWS Documentation

Amazon SageMaker Developer Guide

"Best Practices for Amazon SageMaker Training"

Section: "Prepare the Data".

3. AWS Documentation - Regions and Availability Zones: Availability Zones within a Region are connected with low-latency networking

but latency is lowest for resources within the same AZ. For high-performance computing (HPC) workloads like distributed training

co-locating resources in a single AZ is a standard best practice to minimize network latency.

Source: AWS Documentation

AWS Fundamentals: Core Concepts

"Regions and Availability Zones".

Q: 15

A company stores time-series data about user clicks in an Amazon S3 bucket. The raw data consists of millions of rows of user activity every day. ML engineers access the data to develop their ML models. The ML engineers need to generate daily reports and analyze click trends over the past 3 days by using Amazon Athen a. The company must retain the data for 30 days before archiving the data. Which solution will provide the HIGHEST performance for data retrieval?

Options

Discussion

Casey Feb 14, 2026 1:31 pm

Option C

Sanjay K. Feb 14, 2026 9:37 pm

Option C but if Athena ever changed how it handled non-partitioned buckets that would flip this. Otherwise partitioning by date still wins.

Morgan Feb 21, 2026 5:59 am

C , pretty sure that's what Athena is optimized for. Partitioning by date prefix lets you scan just what you need, so queries are way faster than hitting everything. Splitting to buckets like D isn't really how Athena likes it. Open to debate if anyone had a different thought though.

Hannah M. Feb 17, 2026 11:41 pm

C , partitioning by date prefix is key for Athena speed. D is tempting but separate buckets don't help query performance here.

Owen R. Mar 1, 2026 1:46 pm

B or D. Both work but D looks simpler for archiving, might be missing something here.

Aaron C. Feb 21, 2026 10:59 am

C or D? D is overkill since separate S3 buckets per day makes management a pain and doesn't boost Athena performance. I think C (partition by date prefix) is best because Athena prunes partitions, so queries are way faster. Pretty sure that's what AWS recommends.

PracticalCandidate4888 Feb 28, 2026 4:14 am

I think C here. Partitioning by date prefix lets Athena scan just what it needs for recent data, way faster than scanning everything. Also lines up with the lifecycle requirement. Saw a similar question on a practice test. Makes sense?

Sara J. Mar 3, 2026 12:10 pm

Probably D, question is really clear and concise for ML/Athena use cases.

Be respectful. No spam.

Correct Answer:

Explanation

The most critical factor for Amazon Athena query performance is minimizing the amount of data scanned. The query pattern is time-based ("past 3 days"). Partitioning the data in Amazon S3 by a date-based prefix (e.g., s3://bucket/year=YYYY/month=MM/day=DD/) allows Athena to use partition pruning. This means Athena's query engine can directly access and scan only the folders corresponding to the last three days, ignoring all other data. This dramatically reduces query latency and cost, providing the highest performance. Using S3 Lifecycle policies to transition older partitions (prefixes) to S3 Glacier Flexible Retrieval is an efficient, automated way to meet the 30-day archiving requirement.

Why Incorrect

A: Not partitioning the data forces Athena to perform a full table scan for every query, resulting in the lowest possible performance and highest cost.

B: This option introduces unnecessary complexity with AWS Lambda functions for copying data and does not explicitly mention partitioning, which is the primary performance driver for Athena.

D: Creating a separate S3 bucket for each day is operationally complex and difficult to manage. Querying across multiple buckets is more cumbersome than querying partitions within a single table.

References

1. Amazon Athena Documentation - Partitioning Data: "Partitioning divides your table into parts and keeps the related data together based on column values such as date... Partitioning restricts the amount of data scanned by a query

thus improving performance and reducing costs."

Source: AWS Documentation

Partitioning data in Athena. https://docs.aws.amazon.com/athena/latest/ug/partitioning-data.html

2. Amazon Athena Documentation - Performance Tuning: The number one performance tuning tip for Athena is to partition data. "By partitioning your data

you can restrict the amount of data scanned by each query

thus improving performance and reducing cost."

Source: AWS Documentation

Top 10 performance tuning tips for Amazon Athena

Tip 1. https://aws.amazon.com/blogs/big-data/top-10-performance-tuning-tips-for-amazon-athena/

3. Amazon S3 Documentation - Lifecycle Management: S3 Lifecycle policies can be configured to act on objects based on their key prefix

which directly aligns with archiving date-based partitions. "You can create rules to define actions that you want Amazon S3 to take during an object's lifetime... The action is based on the object's age or creation date."

Source: AWS Documentation

Managing your storage lifecycle. https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-lifecycle-mgmt.html

Question 11 of 20 · Page 2 / 2

Premium Access Includes

✓ Quiz Simulator
✓ Exam Mode
✓ Progress Tracking
✓ Question Saving
✓ Flash Cards
✓ Drag & Drops
✓ 3 Months Access
✓ PDF Downloads

Get Premium Access

Premium Access Includes

FLASH OFFER

avail 10% DISCOUNT on YOUR PURCHASE