A media company is developing an AI platform for video content analysis that requires storing and
processing large volumes of unstructured video data. The platform must support high throughput for
data ingestion and provide efficient access for real-time analytics. Given these requirements, which
storage strategy should the company implement?
Correct Answer: C (Assuming the options are similar to: A) A scale-up NAS solution using NFS, B) A SAN using Fibre Channel for block storage, C) A scale-out parallel file system, D) Direct-attached storage on each compute node)
Explanation: The requirements—large volumes of unstructured data, high-throughput ingestion, and efficient parallel access for analytics—are characteristic of high-performance computing (HPC) and large-scale AI workloads. A scale-out parallel file system (e.g., Lustre, IBM Spectrum Scale/GPFS) is designed specifically for this scenario. It stripes data across multiple storage servers and disks, allowing many clients (compute nodes) to read and write data in parallel at very high aggregate bandwidth. This architecture avoids the bottlenecks of traditional NAS and provides the shared, high-performance namespace essential for distributed AI training and real-time analytics on large datasets.
Why Incorrect Options are Wrong:
A) A scale-up NAS solution using NFS: A traditional NAS controller becomes a performance bottleneck when many clients access it simultaneously, failing to meet the high-throughput requirement.
B) A SAN using Fibre Channel for block storage: SANs provide block-level access, which is not ideal for sharing large, unstructured files across many compute nodes and requires a complex volume management layer.
D) Direct-attached storage on each compute node: This creates data silos, making it difficult to manage a large, shared dataset and requiring extensive data copying, which is inefficient for this use case.
References:
1. NVIDIA. (2023). NVIDIA DGX SuperPOD Reference Architecture. This document consistently specifies high-performance, scale-out parallel file systems as the primary storage tier for AI workloads to feed the GPUs efficiently.
2. Shainer, G., & Shusterman, V. (2020). NVIDIA GPUDirect Storage: A Direct Path Between Storage and GPU Memory. This technology, central to NVIDIA's AI platform, is designed to work with parallel filesystems to maximize I/O throughput by bypassing the CPU.
3. Maltzahn, C., & Bent, J. (2017). A Survey of Distributed Storage Systems for Big Data and Scientific HPC. University of California, Santa Cruz. UCSC-SOE-17-07. This academic survey discusses the architectural advantages of parallel file systems (like Lustre, GPFS) for data-intensive scientific and analytics workloads.