Large language models are driving a surge in inference workloads - from simple chatbots to full agentic workflows. As inference demands grow, so do the dollars spent on GPU compute, making tokens per dollar one of the most crucial metrics for any deployment strategy.
While the AI community often gravitates toward more well-known GPUs, AMD's MI300X quietly stands out. Equipped with 192 GB of HBM3 and memory bandwidth of 5.3 TB/s, it's engineered for large or latency-sensitive tasks. Yet, despite its raw capability, MI300X remains largely underutilized and overlooked. In this blog post, we share our initial exploration into how targeted optimization and quantization can begin to unlock MI300X's promising hardware. Through just two foundational optimizations, we demonstrate the platform's early potential to become a cost-effective, high-throughput inference solution for open LLMs like LLaMA 3.1 8B - while revealing significant opportunities for further advancement.