1. AWS Glue Developer Guide: "AWS Glue provides a serverless environment to run your ETL jobs on a fully managed
scale-out Apache Spark environment... AWS Glue crawlers scan your data stores to determine the schema for your data and then create a metadata table in your AWS Glue Data Catalog."
Source: AWS Glue Developer Guide
"What is AWS Glue?"
Introduction section.
2. AWS Documentation - Choosing between AWS Glue and Amazon EMR: "AWS Glue is a good choice when your use case is ETL and you are looking for a serverless offering... If you want to avoid managing servers
you can use AWS Glue to run your Spark and Python shell workloads." This highlights Glue's lower operational overhead.
Source: AWS Big Data Blog
"Choosing between AWS Glue and Amazon EMR".
3. AWS Glue Developer Guide - Crawlers: "A crawler connects to a data store
progresses through a prioritized list of classifiers to determine the schema for your data
and then creates metadata tables in your AWS Glue Data Catalog." This confirms its schema detection capability for diverse sources.
Source: AWS Glue Developer Guide
"Defining Crawlers"
Crawler concepts section.
4. AWS Lambda Developer Guide - Quotas: "Timeout: 900 seconds (15 minutes)". This confirms the execution time limit that makes Lambda unsuitable for the described large-scale ETL task.
Source: AWS Lambda Developer Guide
"Lambda quotas".
5. Amazon Redshift Database Developer Guide - Redshift Spectrum: "Amazon Redshift Spectrum is a feature of Amazon Redshift that enables you to run SQL queries against exabytes of unstructured data in Amazon S3." This shows Redshift Spectrum is for querying data already in S3
not for performing the initial ETL to S3.
Source: Amazon Redshift Database Developer Guide
"Getting started with Amazon Redshift Spectrum".