FSx Storage Options: Lustre for HPC, Windows for SMB, and ONTAP for Enterprise NAS

3 min readCloud Infrastructure

Amazon FSx provides four managed file systems: Lustre (high-performance parallel I/O), Windows File Server (SMB/AD integration), NetApp ONTAP (enterprise NAS features), and OpenZFS. Choosing correctly depends on I/O pattern, client OS, and whether you need Windows-native features.

awsfsxstoragehpc

Four FSx flavors, four use cases

Amazon FSx is not one service — it's four managed file systems with different underlying technology, protocols, and performance characteristics:

| FSx Type | Protocol | Best for | Peak throughput | |---|---|---|---| | FSx for Lustre | POSIX (Lustre client) | HPC, ML training, parallel I/O | Hundreds of GB/s | | FSx for Windows File Server | SMB, NTFS | Windows apps, AD integration | Tens of GB/s | | FSx for NetApp ONTAP | NFS, SMB, iSCSI | Enterprise NAS, multi-protocol | Tens of GB/s | | FSx for OpenZFS | NFS | Linux workloads needing ZFS | Tens of GB/s |

EFS (NFS, POSIX, Linux) is often the right answer before reaching for FSx. FSx is for when EFS's performance or protocol support isn't sufficient.

FSx for Lustre: parallel file system for compute-intensive workloads

ConceptAWS FSx

Lustre is a high-performance parallel file system designed for workloads that need to distribute I/O across many servers simultaneously. FSx for Lustre integrates with S3 — data can be imported from S3 on first access and exported back after processing. ML training jobs on GPU clusters are the primary use case.

Prerequisites

  • S3 object storage
  • HPC cluster architecture
  • NFS vs parallel file systems

Key Points

  • Scratch file systems: no replication, lowest cost, auto-deletes after 7 days (or configurable). For ephemeral compute jobs.
  • Persistent file systems: replicated across AZs, data survives. For long-running workloads needing durability.
  • S3 data repository: link FSx to an S3 bucket. Data loads lazily on first access (or eagerly via import). Results export back to S3.
  • Striping: large files are split across multiple storage servers. Multiple workers reading different parts of the same file get full parallel throughput.

FSx for Lustre: ML training pattern

The standard pattern for ML training on AWS: data in S3, FSx for Lustre as the high-speed training scratch space, EC2/SageMaker GPU instances mount via Lustre client.

# Create a persistent FSx Lustre file system linked to S3
aws fsx create-file-system \
  --file-system-type LUSTRE \
  --storage-capacity 1200 \
  --subnet-ids subnet-12345678 \
  --lustre-configuration '{
    "DeploymentType": "PERSISTENT_2",
    "PerUnitStorageThroughput": 250,
    "DataCompressionType": "LZ4",
    "ImportPath": "s3://my-training-data/datasets/",
    "ExportPath": "s3://my-training-data/results/",
    "AutoImportPolicy": "NEW_CHANGED_DELETED"
  }'

# Mount on EC2 instance (requires lustre-client kernel module)
sudo amazon-linux-extras install -y lustre
sudo mount -t lustre \
  -o relatime,flock \
  fs-12345678.fsx.us-east-1.amazonaws.com@tcp:/abcdef \
  /fsx

PerUnitStorageThroughput: 250 means 250 MB/s per TB of storage. A 4.8TB file system provides ~1.2 GB/s throughput. For deep learning training, this keeps GPUs fed without I/O bottlenecking.

AutoImportPolicy: NEW_CHANGED_DELETED keeps the FSx file system synchronized with S3 as objects are added or removed — new training data added to S3 becomes visible in the mount point without manual re-import.

FSx for Windows File Server: SMB and Active Directory

FSx for Windows File Server provides a native Windows file system with full NTFS, SMB 2.x/3.x, and Active Directory integration. It's the migration path for Windows file servers that need to move to AWS.

resource "aws_fsx_windows_file_system" "main" {
  active_directory_id = aws_directory_service_directory.corp.id
  storage_capacity    = 300  # GB
  subnet_ids          = [aws_subnet.private_az1.id, aws_subnet.private_az2.id]
  throughput_capacity = 64   # MB/s

  self_managed_active_directory {
    dns_ips     = ["10.0.1.10", "10.0.1.20"]
    domain_name = "corp.example.com"
    username    = "admin"
    password    = var.ad_password
  }

  deployment_type = "MULTI_AZ_1"  # HA across two AZs
  preferred_subnet_id = aws_subnet.private_az1.id

  aliases = ["fileserver.corp.example.com"]
}

Key capabilities: DFS Namespaces (namespace-based file paths), Windows ACLs and NTFS permissions, Shadow Copies (VSS-based backups accessible as previous versions), SMB encryption in transit.

📝FSx for NetApp ONTAP: multi-protocol enterprise storage

FSx for NetApp ONTAP is the most feature-rich FSx option. It supports NFS, SMB, and iSCSI simultaneously on the same file system — a single volume can be mounted by Linux (NFS), Windows (SMB), and block storage clients (iSCSI) concurrently.

Key enterprise features not available in EFS or other FSx options:

FlexClone: instant, space-efficient clones of volumes or individual files. A clone of a 10TB volume uses no additional storage until data diverges. Used for dev/test environments — clone production data instantly, let developers modify their copy without affecting production.

SnapMirror: asynchronous replication to another FSx for ONTAP file system in a different region. Used for DR: if primary region fails, promote the secondary ONTAP file system.

Tiering: automatically moves infrequently-accessed data to S3 (capacity pool tier). Hot data stays on SSD; cold data moves to S3-backed cheap storage transparently.

resource "aws_fsx_ontap_file_system" "main" {
  storage_capacity    = 1024  # GB
  subnet_ids          = [aws_subnet.private_az1.id, aws_subnet.private_az2.id]
  preferred_subnet_id = aws_subnet.private_az1.id
  deployment_type     = "MULTI_AZ_1"
  throughput_capacity = 256

  fsx_admin_password = var.fsx_admin_password
}

ONTAP is appropriate when: migrating existing NetApp NAS to AWS, needing SMB + NFS on the same data, requiring enterprise data management features (cloning, SnapMirror), or using Oracle databases (supports NFS with database validation).

EFS vs FSx: the decision

Choose EFS when:

  • Linux workloads, POSIX file system, NFS
  • Multiple instances need concurrent read-write access with automatic scaling
  • Elastic throughput (Elastic mode) handles variable workload
  • Operational simplicity matters over maximum performance

Choose FSx for Lustre when:

  • HPC, ML training, financial modeling — workloads needing hundreds of GB/s
  • Need S3 integration (lazy loading, export back)
  • Scratch compute clusters that start/stop — scratch FSx deletes automatically

Choose FSx for Windows when:

  • Windows applications requiring SMB, NTFS, or Windows ACLs
  • Active Directory integration required
  • Migrating on-premises Windows file servers

Choose FSx for ONTAP when:

  • Multi-protocol access (NFS + SMB on same data)
  • Enterprise data management (FlexClone, SnapMirror)
  • Existing NetApp infrastructure being extended to AWS

A genomics research team runs a pipeline that reads a 5TB reference genome dataset on 500 EC2 GPU instances simultaneously. Each instance reads different parts of the dataset in parallel. They're currently using S3 with the boto3 SDK but experiencing I/O bottlenecks. Which FSx type would best address their bottleneck?

medium

The dataset is 5TB and stored in S3. 500 GPU instances need concurrent read access. The bottleneck is I/O throughput to the training data. The workload runs for 12-hour bursts, then terminates.

  • AFSx for Windows File Server — high throughput and Windows-compatible
    Incorrect.FSx for Windows requires SMB clients. GPU EC2 instances for ML are Linux. And Windows File Server is not designed for parallel HPC workloads.
  • BFSx for Lustre with S3 data repository link — designed for exactly this parallel I/O pattern, with lazy loading from S3 on first access
    Correct!FSx for Lustre is a parallel file system specifically designed for many clients reading different parts of large files simultaneously — the core pattern of ML training. The S3 data repository integration means the 5TB dataset loads lazily on first access (or can be pre-imported). With scratch deployment type, it auto-cleans after the burst workload ends. EFS would provide NFS but can't deliver hundreds of GB/s required for 500 concurrent readers.
  • CEFS with Max I/O performance mode — scales to thousands of clients
    Incorrect.EFS Max I/O provides higher aggregate throughput but adds higher latency. EFS's throughput ceiling (multiple GB/s maximum) is far below what 500 clients reading a 5TB dataset simultaneously needs from a high-performance parallel file system.
  • DFSx for NetApp ONTAP — enterprise NAS with multi-protocol support
    Incorrect.ONTAP is excellent for enterprise NAS and multi-protocol access but isn't optimized for the parallel HPC I/O pattern. Lustre is specifically designed for this workload.

Hint:Which file system protocol is designed for parallel HPC workloads where many clients read different portions of large files simultaneously?