Unlocking AMD MI300X for High-Throughput, Low-Cost LLM Inference

July 11, 2025

Introduction

Large language models are driving a surge in inference workloads - from simple chatbots to full agentic workflows. As inference demands grow, so do the dollars spent on GPU compute, making tokens per dollar one of the most crucial metrics for any deployment strategy.

While the AI community often gravitates toward more well-known GPUs, AMD's MI300X quietly stands out. Equipped with 192 GB of HBM3 and memory bandwidth of 5.3 TB/s, it's engineered for large or latency-sensitive tasks. Yet, despite its raw capability, MI300X remains largely underutilized and overlooked. In this blog post, we share our initial exploration into how targeted optimization and quantization can begin to unlock MI300X's promising hardware. Through just two foundational optimizations, we demonstrate the platform's early potential to become a cost-effective, high-throughput inference solution for open LLMs like LLaMA 3.1 8B - while revealing significant opportunities for further advancement.

Problem

The main challenge lies not in the MI300X hardware, but in the relative maturity of the supporting software ecosystem. Most LLM inference frameworks have benefited from years of deep, low-level optimization for CUDA-based systems. Techniques like kernel fusion, memory management, scheduling, and quantization are all tailored to maximize performance on NVIDIA GPUs. By comparison, ROCm (the CUDA of the AMD ecosystem) is newer to the scene and still building out similar support. As a result, features such as 192 GB of HBM3 and industry-leading memory bandwidth may not be fully leveraged by standard, off-the-shelf software. This difference in software maturity can create the impression that MI300X is less capable, when in fact, it simply hasn't seen the same level of software optimization and tuning.

Rich memory and raw compute are wasted if the software isn't tuned to use them. In essence, MI300X is a sleeping powerhouse, only held back by lack of ecosystem support.

Solution

To begin unlocking the potential of the MI300X for large language model inference, we started with two foundational optimizations using vLLM in the Docker container rocm/vllm-dev:nightly_0624_rc2_0624_rc2_20250620, which includes a nightly version of vLLM:

Optimization 1: Custom Kernels + HIP Graph Optimization

Custom MI300X GPU kernels: Developed kernels tailored specifically for MI300X architecture to optimize key operations like attention and GEMM
HIP graph optimization: Enabled full_cuda_graph:true in vLLM configuration, which on MI300X translates to HIP graph functionality through ROCm. This caches and reuses entire GPU execution graphs as directed acyclic graphs (DAGs), minimizing CPU overhead and launch latency for concurrent inference streams

Technical Note: While vLLM uses CUDA-like naming (full_cuda_graph:true) for framework compatibility, on AMD MI300X hardware this actually leverages ROCm's HIP Graph capabilities.

Optimization 2: FP8 Quantization

Applied FP8 quantization on top of the custom kernel and CUDA graph optimizations

We conducted initial performance evaluation with 200 concurrent inference streams on LLaMA 3.1 8b, using three representative I/O configurations:

Short-input/long-output: 500/2000
Same-input/same-output: 1000/1000
Long-input/short-output: 5000/500

The results for these optimizations are shown below against the baseline BF16 implementation.

Performance Comparison

We show below that quantization did not cause a meaningful reduction in accuracy, only dropping 2% against the gsm8k dataset running at pass @ 5.

Accuracy

Setup	Exact Match
Baseline	0.49
Custom Kernels	0.49
Custom Kernels + FP8	0.47

Comparison to NVIDIA GPUs

To evaluate our initial optimizations, we compared our early MI300X results against NVIDIA's H100 performance using data from NVIDIA NIM LLM Benchmarking (Last updated on Jun 26, 2025). Our preliminary analysis reveals two key technology comparisons that hint at MI300X's emerging potential in the inference landscape.

Technology Comparison 1: FP8 Performance Across Workloads

FP8 Comparison

This comparison examines FP8 quantization performance across different input/output ratios between our Optimization 2 implementation and NVIDIA. The analysis covers three representative workloads: short-input/long-output (500:2000), balanced (1000:1000), and long-input/short-output (5000:500). While NVIDIA FP8 achieves higher absolute throughput across all scenarios, Optimization 2 consistently delivers superior cost efficiency.

Technology Comparison 2: Complete Platform Analysis

Run Comparison at 1000:1000

We conducted a comprehensive head-to-head comparison of all implementations at 200 concurrency for the 1000:1000 workload, testing four configurations:

Optimization 1: 6,150 tokens/sec
Optimization 2: 7,353 tokens/sec
NVIDIA (BF16): 7,425 tokens/sec
NVIDIA (FP8): 11,553 tokens/sec

Performance and Economic Overview

Using comprehensive cloud pricing data across 10 major providers, we calculated the cost efficiency for each platform:

Setup	Tokens/sec	Cost/hr*	Tokens per Dollar
H100 (BF16)	7,425	$4.99	1,488
H100 (FP8)	11,553	$4.99	2,315
MI300X (Optimization 1)	6,150	$1.99	3,090
MI300X (Optimization 2)	7,353	$1.99	3,695

*Average H100 pricing across 10 major cloud providers (source 1, source 2)

Early findings: Even with these initial optimizations, NVIDIA FP8 achieves higher absolute throughput but at a higher operational cost. When comparing cost efficiency (tokens per dollar), MI300X Optimization 2 delivers:

148% more tokens per dollar than H100 (BF16) (3,695 vs. 1,488)
60% more tokens per dollar than H100 (FP8) (3,695 vs. 2,315)

These results suggest the MI300X platform has strong potential for cost-sensitive deployments, even before exploring more advanced optimizations.

What's Next

With just two basic steps - custom kernels and FP8 quantization - we've started to unlock the MI300X's potential for LLaMA 3.1 8B inference. Already, MI300X is showing:

Competitive throughput: Approaching H100-based solutions
Excellent cost efficiency: 60% better value than NVIDIA at 3,695 tokens per dollar
Plenty of room to grow: We're only scratching the surface with a suite of new optimizations coming soon.

We're just getting started. There are many ways to push MI300X further - better profiling tools, more efficient kernels, better quantization, improved memory use, and more.

The Bigger Picture

This is proof that with the right software, new hardware can drive a more affordable, diverse AI ecosystem. Stay tuned for more.

Want to help push GPUs further?

Connect with us: