Internship Report > Translated Blogs > Tuning Guide for AMD-based Amazon EC2 Instances

Tuning Guide for AMD-based Amazon EC2 Instances

By Suyash Nadkarni and Dylan Souvage — September 12, 2025 · Amazon EC2 · Best Practices · Expert (400) · Technical How-to

As organizations move more mission-critical workloads to the cloud, optimizing price-performance becomes essential. Amazon EC2 instances powered by AMD EPYC processors offer high core density, large memory bandwidth, and hardware-enabled security features, making them a strong fit for compute-, memory-, or I/O-intensive applications. This post explains how to choose the right AMD-based Amazon EC2 instance family and describes tuning techniques that improve workload efficiency—whether you run simulations, large-scale analytics, or inference jobs.

Amazon EC2 offers AMD options across multiple EPYC generations. We focus on optimization strategies for 3rd- and 4th-generation processors, which are designed for compute- and memory-heavy workloads.

Generation 3 (M6a, R6a, C6a, Hpc6a): Balanced compute, memory, and storage—ideal for analytics, web servers, and HPC.
Generation 4 (M7a, R7a, C7a, Hpc7a): Up to 50% more performance than prior AMD generations, with AVX-512 support, DDR5 memory, and Simultaneous Multithreading (SMT) disabled so each vCPU maps to a physical core for better isolation.

Choose the right AMD EPYC instance family

Selecting the correct AMD EPYC instance begins with understanding how your application consumes compute, memory, storage, and network resources. Each family is tailored to a specific profile.

Compute-intensive workloads — large-scale numerical calculations, simulations, or encoding that require high CPU throughput and advanced instruction support.

Recommended: C7a, Hpc7a, C6a, Hpc6a
Use cases: scientific computing, financial modeling, media transcoding, encryption, ML inference

Big Data & Analytics — large data processing and analytics that benefit from high memory bandwidth and balanced compute-to-memory ratios.

Recommended: R7a, M7a, R6a, M6a
Use cases: stream processing, real-time analytics, BI tools, distributed caching

Database workloads — relational, NoSQL, or in-memory databases that need consistent memory performance and high I/O throughput.

Recommended: R7a, M7a, R6a, M6a
Use cases: MySQL, PostgreSQL, MongoDB, Cassandra, Redis

Web and application servers — variable request workloads that need balanced compute, memory, and networking.

Recommended: C7a, M7a, C6a, M6a
Use cases: web servers, CMS platforms, e-commerce sites, API endpoints

AI/ML on CPU — ML tasks that do not require GPUs (such as inference or preprocessing).

Recommended: C7a, Hpc7a, M7a
Use cases: fluid dynamics, genomics, seismic analysis, engineering simulations

Matching the instance to the workload provides predictable performance and cost efficiency. Services like Amazon EC2 Auto Scaling and AWS Compute Optimizer can help with sizing and continuous scaling decisions.

Optimize AMD EPYC-based Amazon EC2 instances

4th-generation AMD EPYC processors use a modular “chiplet” architecture. Each CPU is composed of multiple Core Complex Dies (CCD), and each CCD contains one or more core complexes (CCX). A CCX bundles up to eight physical cores; each core includes 1 MB of private L2 cache, and the eight cores share 32 MB of L3 cache. The CCDs connect to a central I/O die that manages memory and inter-chip links.

(Die diagram: Zen 4 CPU with eight cores per die)

This modular design lets instances such as m7a.24xlarge and m7a.48xlarge expose very high core counts—up to 96 physical cores per socket. For example:

m7a.24xlarge delivers 96 physical cores from a single socket.
m7a.48xlarge spans two sockets for 192 physical cores.

Understanding how EC2 instance sizes map to the underlying processor layout helps you optimize cache locality. Workloads that rely on shared-memory access or thread synchronization—such as HPC or in-memory databases—benefit from choosing sizes that minimize cross-socket communication and maximize L3-cache locality.

(BLOCK diagram: EPYC chiplet layout)

4th-generation AMD EPYC instances run with SMT disabled, so each vCPU maps directly to a physical core. This eliminates resource sharing between sibling threads (execution units, caches, etc.) and can reduce intra-core interference. The result is lower jitter and more consistent throughput for HPC, ML inference, and transactional database workloads.

CPU optimizations

Tools like htop reveal CPU usage patterns, system load averages, and per-process resource consumption. Evaluate CPU utilization in the context of workload requirements. Sustained utilization near 100% may indicate the workload is CPU-bound. Before resizing instances, enabling Auto Scaling, or switching families, analyze tuning opportunities that improve performance without infrastructure changes. Load averages that frequently exceed the vCPU count also signal saturation.

Use the L3 cache effectively

L3 cache is a fast shared cache accessible by a group of cores. On AMD-based EC2 instances, cores are grouped into L3 cache slices shared by a subset of cores on the same socket. Threads scheduled within the same slice access shared data more efficiently, reducing memory latency. On 4th-generation instances such as m7a.2xlarge or r7a.2xlarge, all vCPUs often map to cores within the same L3 slice. For larger sizes (m7a.8xlarge and above), thread pinning—assigning threads to specific physical cores—helps maintain locality and lowers performance variability.

taskset -c 0-3 ./your_application

Use lscpu or lstopo to inspect the CPU topology and group related threads onto cores that share L3 cache.

Optimize Docker containers

By default, container runtimes such as Docker allow the OS scheduler to move containers across any CPU core. That flexibility can introduce variability when containers bounce between cores that do not share cache. Pin containers to specific cores with --cpuset-cpus to improve cache efficiency and reduce jitter:

docker run --cpuset-cpus="1,3" my-container

Choose cores based on the CPU topology so that containers stay on cores sharing the same L3 slice.

Set the CPU frequency governor

Some operating systems dynamically scale CPU frequency to save power via the CPU frequency governor. For latency-sensitive or compute-bound workloads, switch to performance mode so the CPU runs at max frequency under load:

sudo cpupower frequency-set -g performance

Benchmark with other governors (such as ondemand or schedutil) to confirm the performance mode delivers measurable gains without excessive power usage.

Use architecture-specific compiler flags

When compiling C/C++ applications, architecture flags such as -march=znverX enable AMD EPYC–specific optimizations (vectorization, floating-point throughput, etc.). Ensure that the compiler and target instance generation match the flag; binaries built with -march=znver4 will raise SIGILL on older instances like M5a.

AMD EPYC generation	`-march` flag	Minimum GCC version	Minimum LLVM/Clang version
Generation 4 (M7a)	`znver4`	GCC 12	Clang 15
Generation 3 (M6a)	`znver3`	GCC 11	Clang 13
Generation 2 (M5a)	`znver2`	GCC 9	Clang 11

Supported flags (GCC 11+ / Clang 13+):

Generation 4 (M7a, R7a, C7a, Hpc7a): -march=znver4
Generation 3 (M6a, R6a, C6a): -march=znver3
Generation 2 (M5a, R5a, C5a): -march=znver2

When to enable AVX-512 and VNNI

4th-generation AMD EPYC instances support SIMD instruction sets such as AVX2, AVX-512, and VNNI. These can boost throughput for vector-heavy workloads (ML inference, image processing, scientific simulations). Only enable them on generations that support the instructions to avoid SIGILL errors on older hardware.

gcc -mavx2 -mavx512f -O2 your_program.c -o your_program
# Investigate vectorization:
-ftree-vectorizer-verbose=2 -fopt-info-vec-missed

AMD Optimizing CPU Libraries (AOCL)

AMD Optimizing CPU Libraries (AOCL) provide tuned math routines—vector, scalar, RNG, FFT, BLAS, LAPACK, and more—built specifically for AMD EPYC processors. Link your application against AOCL to leverage hardware optimizations without rewriting code.

Configure AOCL

export AOCL_ROOT=/path/to/aocl

gcc -I$AOCL_ROOT/include -L$AOCL_ROOT/lib -lamdlibm -lm your_program.c -o your_program
# Vector math
gcc -lamdlibm -fveclib=AMDLIBM -lm your_program.c -o your_program
# Faster scalar math
gcc -lamdlibm -fsclrlib=AMDLIBM -lamdlibmfast -lm your_program.c -o your_program

export AOCL_PROFILE=1
./your_program

Profiling generates aocl_profile_report.txt, which lists call counts, execution time, and thread usage so you can focus optimization on the hottest routines.

Conclusion

We showed how to match AMD-based Amazon EC2 instance families to workload characteristics and how to apply tuning techniques focused on CPU utilization, thread placement, cache efficiency, and math libraries. These practices are especially valuable for CPU-bound or latency-sensitive workloads where consistent performance is critical.

Ready to get started? Sign in to the AWS Management Console and launch AMD EPYC–powered Amazon EC2 instances to begin optimizing your workloads today.

TAGS: AMD