Blog Posts
Trace Compare - Compare vLLM traces across platforms

Trace Compare: Compare vLLM traces across platforms

Get accurate 1:1 kernel mappings across hardware providers. Compare large vLLM traces in seconds with clean prefill vs. decode separation.

Wafer Workspaces - GPU compute for coding agents

Workspaces: GPU Compute for Your Coding Agent

Give your AI coding assistant direct access to GPUs. No manual SSH setup, no Docker, or infrastructure management.

CUDA Compiler Explorer in VS Code

Cloud Compiler Analyzer (PTX/SASS) Inside Your IDE

Cloud CUDA compilation with PTX/SASS output, PyTorch headers, and VS Code integration. No local CUDA install required.

Nordlys Labs case study - 8x faster CUDA kernel optimization

Nordlys Labs: 8x Faster Routing with Wafer-Guided Kernel Optimization

How a non-kernel-expert achieved 8x speedup on latency-critical CUDA clustering code using profile-guided optimization with Wafer.

Profile-guided GPU kernel optimization with ncu

Profile-Guided GPU Kernel Optimization

How adding profiling tools to our CLI helped an agent break through a theory-based optimization plateau, achieving 11.65x speedup on the Kimi Delta Attention kernel.

The year of the LLM GPU kernel engineer

The Year of the LLM GPU Kernel Engineer

We used an AI agent to optimize AMD's topk_sigmoid kernel, achieving a 9x speedup over PyTorch. Here's exactly how our agent did it

Reward hacking in LLM-generated kernels

Case Study: A 104x (?) Speedup on KernelBench

How a fused kernel claiming 104x speedup passed our correctness checks while reading garbage memory, and the determinism check that catches it.

Water lilies painting representing HIP kernel optimization

Which models are the most HIP?

LLM-generated kernels are all the rage right now. We used frontier AI models to write HIP kernels for KernelBench and ran them on MI300Xs. Which ones performed the best?

wafer-ai CLI - GPU Superpowers for Your Coding Agent

wafer-ai CLI: GPU Superpowers for Your Coding Agent

Give your AI coding assistant direct access to GPU documentation, trace analysis, and remote kernel evaluation with the wafer-ai CLI.

GPU Docs Web App

GPU Docs: Now Available on the Web

The GPU documentation tool that thousands of engineers loved in our IDE extension is now available as a standalone web app.

ROCprofiler Compute in VS Code showing GPU architecture diagram

Introducing ROCprofiler Compute: AMD GPU Profiling in Your IDE

Profile AMD GPUs directly in VS Code and Cursor. View hardware metrics, roofline analysis, and kernel stats — all without leaving your editor.

Wafer Perfetto Trace Viewer in VS Code

Introducing Wafer's Built-in Perfetto Trace Viewer

Open Chrome trace JSON files directly in your IDE with full Perfetto functionality — timeline, flamegraphs, SQL, and metrics.

Wafer Extension - Your GPU Development Stack

Introducing the Wafer Extension for VS Code and Cursor

Wafer is the GPU development stack that lives inside your editor: profiling (NCU), compiler explorer, and enhanced GPU docs.

Chip Benchmark visualization showing hardware performance comparison

Introducing Chip Benchmark: Hardware-Centric Performance Insights for AI Workloads

As the AI hardware ecosystem rapidly expands, choosing the right accelerator has become increasingly complex. We're excited to introduce Chip Benchmark, an open-source benchmarking suite purpose-built to evaluate the performance of open-weight LLMs across diverse hardware platforms.

AMD MI300X optimization visualization showing performance improvements

Unlocking AMD MI300X for High-Throughput, Low-Cost LLM Inference

Large language models are driving a surge in inference workloads. While the AI community often gravitates toward more well-known GPUs, AMD's MI300X quietly stands out. Equipped with 192 GB of HBM3 and memory bandwidth of 5.3 TB/s, we explore how targeted optimization and quantization can unlock its potential.