FSx Storage Options: Lustre for HPC, Windows for SMB, and ONTAP for Enterprise NAS — Blog

Four FSx flavors, four use cases

Amazon FSx is not one service — it's four managed file systems with different underlying technology, protocols, and performance characteristics:

| FSx Type | Protocol | Best for | Peak throughput | |---|---|---|---| | FSx for Lustre | POSIX (Lustre client) | HPC, ML training, parallel I/O | Hundreds of GB/s | | FSx for Windows File Server | SMB, NTFS | Windows apps, AD integration | Tens of GB/s | | FSx for NetApp ONTAP | NFS, SMB, iSCSI | Enterprise NAS, multi-protocol | Tens of GB/s | | FSx for OpenZFS | NFS | Linux workloads needing ZFS | Tens of GB/s |

EFS (NFS, POSIX, Linux) is often the right answer before reaching for FSx. FSx is for when EFS's performance or protocol support isn't sufficient.

FSx for Lustre: parallel file system for compute-intensive workloads

ConceptAWS FSx

Lustre is a high-performance parallel file system designed for workloads that need to distribute I/O across many servers simultaneously. FSx for Lustre integrates with S3 — data can be imported from S3 on first access and exported back after processing. ML training jobs on GPU clusters are the primary use case.

Prerequisites

S3 object storage
HPC cluster architecture
NFS vs parallel file systems

Key Points

Scratch file systems: no replication, lowest cost, auto-deletes after 7 days (or configurable). For ephemeral compute jobs.
Persistent file systems: replicated across AZs, data survives. For long-running workloads needing durability.
S3 data repository: link FSx to an S3 bucket. Data loads lazily on first access (or eagerly via import). Results export back to S3.
Striping: large files are split across multiple storage servers. Multiple workers reading different parts of the same file get full parallel throughput.

FSx for Lustre: ML training pattern

The standard pattern for ML training on AWS: data in S3, FSx for Lustre as the high-speed training scratch space, EC2/SageMaker GPU instances mount via Lustre client.

# Create a persistent FSx Lustre file system linked to S3
aws fsx create-file-system \
  --file-system-type LUSTRE \
  --storage-capacity 1200 \
  --subnet-ids subnet-12345678 \
  --lustre-configuration '{
    "DeploymentType": "PERSISTENT_2",
    "PerUnitStorageThroughput": 250,
    "DataCompressionType": "LZ4",
    "ImportPath": "s3://my-training-data/datasets/",
    "ExportPath": "s3://my-training-data/results/",
    "AutoImportPolicy": "NEW_CHANGED_DELETED"
  }'

# Mount on EC2 instance (requires lustre-client kernel module)
sudo amazon-linux-extras install -y lustre
sudo mount -t lustre \
  -o relatime,flock \
  fs-12345678.fsx.us-east-1.amazonaws.com@tcp:/abcdef \
  /fsx

PerUnitStorageThroughput: 250 means 250 MB/s per TB of storage. A 4.8TB file system provides ~1.2 GB/s throughput. For deep learning training, this keeps GPUs fed without I/O bottlenecking.

AutoImportPolicy: NEW_CHANGED_DELETED keeps the FSx file system synchronized with S3 as objects are added or removed — new training data added to S3 becomes visible in the mount point without manual re-import.

FSx for Windows File Server: SMB and Active Directory

FSx for Windows File Server provides a native Windows file system with full NTFS, SMB 2.x/3.x, and Active Directory integration. It's the migration path for Windows file servers that need to move to AWS.

resource "aws_fsx_windows_file_system" "main" {
  active_directory_id = aws_directory_service_directory.corp.id
  storage_capacity    = 300  # GB
  subnet_ids          = [aws_subnet.private_az1.id, aws_subnet.private_az2.id]
  throughput_capacity = 64   # MB/s

  self_managed_active_directory {
    dns_ips     = ["10.0.1.10", "10.0.1.20"]
    domain_name = "corp.example.com"
    username    = "admin"
    password    = var.ad_password
  }

  deployment_type = "MULTI_AZ_1"  # HA across two AZs
  preferred_subnet_id = aws_subnet.private_az1.id

  aliases = ["fileserver.corp.example.com"]
}

Key capabilities: DFS Namespaces (namespace-based file paths), Windows ACLs and NTFS permissions, Shadow Copies (VSS-based backups accessible as previous versions), SMB encryption in transit.

📝FSx for NetApp ONTAP: multi-protocol enterprise storage

FSx for NetApp ONTAP is the most feature-rich FSx option. It supports NFS, SMB, and iSCSI simultaneously on the same file system — a single volume can be mounted by Linux (NFS), Windows (SMB), and block storage clients (iSCSI) concurrently.

Key enterprise features not available in EFS or other FSx options:

FlexClone: instant, space-efficient clones of volumes or individual files. A clone of a 10TB volume uses no additional storage until data diverges. Used for dev/test environments — clone production data instantly, let developers modify their copy without affecting production.

SnapMirror: asynchronous replication to another FSx for ONTAP file system in a different region. Used for DR: if primary region fails, promote the secondary ONTAP file system.

Tiering: automatically moves infrequently-accessed data to S3 (capacity pool tier). Hot data stays on SSD; cold data moves to S3-backed cheap storage transparently.

resource "aws_fsx_ontap_file_system" "main" {
  storage_capacity    = 1024  # GB
  subnet_ids          = [aws_subnet.private_az1.id, aws_subnet.private_az2.id]
  preferred_subnet_id = aws_subnet.private_az1.id
  deployment_type     = "MULTI_AZ_1"
  throughput_capacity = 256

  fsx_admin_password = var.fsx_admin_password
}

ONTAP is appropriate when: migrating existing NetApp NAS to AWS, needing SMB + NFS on the same data, requiring enterprise data management features (cloning, SnapMirror), or using Oracle databases (supports NFS with database validation).

EFS vs FSx: the decision

Choose EFS when:

Linux workloads, POSIX file system, NFS
Multiple instances need concurrent read-write access with automatic scaling
Elastic throughput (Elastic mode) handles variable workload
Operational simplicity matters over maximum performance

Choose FSx for Lustre when:

HPC, ML training, financial modeling — workloads needing hundreds of GB/s
Need S3 integration (lazy loading, export back)
Scratch compute clusters that start/stop — scratch FSx deletes automatically

Choose FSx for Windows when:

Windows applications requiring SMB, NTFS, or Windows ACLs
Active Directory integration required
Migrating on-premises Windows file servers

Choose FSx for ONTAP when:

Multi-protocol access (NFS + SMB on same data)
Enterprise data management (FlexClone, SnapMirror)
Existing NetApp infrastructure being extended to AWS

A genomics research team runs a pipeline that reads a 5TB reference genome dataset on 500 EC2 GPU instances simultaneously. Each instance reads different parts of the dataset in parallel. They're currently using S3 with the boto3 SDK but experiencing I/O bottlenecks. Which FSx type would best address their bottleneck?

medium

The dataset is 5TB and stored in S3. 500 GPU instances need concurrent read access. The bottleneck is I/O throughput to the training data. The workload runs for 12-hour bursts, then terminates.

AFSx for Windows File Server — high throughput and Windows-compatible
Incorrect.FSx for Windows requires SMB clients. GPU EC2 instances for ML are Linux. And Windows File Server is not designed for parallel HPC workloads.
BFSx for Lustre with S3 data repository link — designed for exactly this parallel I/O pattern, with lazy loading from S3 on first access
Correct!FSx for Lustre is a parallel file system specifically designed for many clients reading different parts of large files simultaneously — the core pattern of ML training. The S3 data repository integration means the 5TB dataset loads lazily on first access (or can be pre-imported). With scratch deployment type, it auto-cleans after the burst workload ends. EFS would provide NFS but can't deliver hundreds of GB/s required for 500 concurrent readers.
CEFS with Max I/O performance mode — scales to thousands of clients
Incorrect.EFS Max I/O provides higher aggregate throughput but adds higher latency. EFS's throughput ceiling (multiple GB/s maximum) is far below what 500 clients reading a 5TB dataset simultaneously needs from a high-performance parallel file system.
DFSx for NetApp ONTAP — enterprise NAS with multi-protocol support
Incorrect.ONTAP is excellent for enterprise NAS and multi-protocol access but isn't optimized for the parallel HPC I/O pattern. Lustre is specifically designed for this workload.

Hint:Which file system protocol is designed for parallel HPC workloads where many clients read different portions of large files simultaneously?