CUDA Agent Paper Review: Teaching LLMs to Write Fast GPU Kernels via RL

baka_mashiroAbout 4 min

CUDA Agent: Large-Scale Agentic RL for High-Performance CUDA Kernel Generation
ByteDance Seed + Tsinghua AIR (SIA-Lab), 2026
cuda-agent.github.io

Writing fast GPU kernels is genuinely hard. You need to understand memory hierarchy, warp scheduling, bank conflicts, tensor core layouts, and about fifty other microarchitectural details that change between GPU generations. Most engineers — including most ML engineers — don't have this knowledge. They use libraries (cuBLAS, cuDNN, FlashAttention) and hope for the best.

CUDA Agent is an attempt to teach an LLM to do this optimization work through reinforcement learning. The headline result: 2.11x average speedup over torch.compile on KernelBench, with 96.8% of generated kernels being faster than the compiler baseline. That's a striking number. Let me explain what they actually built, and then explain why I think the comparison is partially misleading — while also arguing that the core conclusion is still valid.

What They Built

Three Core Components

1. Scalable Data Synthesis (6K Training Ops)

Before any RL, you need training data. The team synthesized approximately 6,000 CUDA operator implementations across different categories (elementwise ops, reductions, matrix operations, attention variants, etc.). These were paired with reference implementations and profiling results, creating a curriculum that spans easy single-kernel tasks (Level 1) up to complex multi-kernel sequences (Level 3).

The synthesis pipeline is important: they didn't just use existing open-source kernels. They generated diverse operator variants to ensure the training distribution covers the kinds of operations the model would need to optimize at test time. This is the unsexy but critical foundation everything else rests on.

2. Skill-Augmented Execution Environment (ReAct + GPU Sandbox)

The agent doesn't just output code once and hope. It operates in a ReAct loop — Reason, Act, Observe — with access to a set of tools in a real GPU execution environment:

Compile: pass the kernel through nvcc, get errors back
Profile: run the kernel with ncu (Nsight Compute), get roofline analysis and bottleneck identification
Correctness check: compare against a reference implementation on 5 random inputs
Iterate: rewrite based on the profiling feedback

The "skill augmentation" means the model also has access to structured CUDA optimization knowledge — essentially a curated reference on tiling strategies, memory access patterns, and optimization techniques — that it can consult during the reasoning steps.

This is genuinely novel. Most prior work gives the model a fixed number of attempts with black-box feedback ("correct/incorrect", "faster/slower"). CUDA Agent gives the model real profiling signals that a human expert would use.

3. RL Training Pipeline (Staged)

The training is where the paper gets interesting. They use a staged approach:

Single-turn PPO warm-up: Start with standard PPO on single-turn kernel generation to get a baseline model that can at least produce valid CUDA.
Rejection Fine-Tuning (RFT): Filter successful rollouts and fine-tune on them, building up a portfolio of working optimizations.
Critic Pre-training: Train a value function that can estimate expected speedup, which is needed for stable multi-turn RL.
Multi-turn RL: Finally, the full agentic loop — multi-step reasoning and tool use, with the critic providing reward shaping.

The staged approach matters because multi-turn RL with sparse rewards is notoriously unstable. By the time you get to the full agentic loop, the model already has a strong foundation.

The Results

On KernelBench:

Level 1 (single ops): 100% faster than torch.compile
Level 2 (fused ops): 100% faster than torch.compile
Level 3 (complex sequences): 92% faster than torch.compile
Overall: 2.11x geometric mean speedup

Against strong proprietary models on Level-3 (the hard benchmark): CUDA Agent scores ~40 points higher than Claude 3.5 Sonnet and Gemini 1.5 Pro with standard prompting approaches.

My Skepticism: The `torch.compile` Comparison Is Unfair

Here's where I push back.

torch.compile is a general-purpose optimizing compiler. It's designed to work across:

Multiple GPU architectures (H100, A100, RTX 4090, consumer GPUs)
Arbitrary computational graphs, not just common operators
Training and inference, with dynamic shapes
Without any GPU-specific tuning

CUDA Agent's generated kernels are optimized for a single specific GPU (the H100 they test on). A kernel optimized for H100's Hopper architecture will be slower on A100's Ampere architecture. The agent has no concept of portability.

Comparing a single-GPU-specialized hand-tuned kernel to a general-purpose compiler and claiming 2.11x speedup is like comparing a professional race car driver's lap time to a commuter car's cruise control. Of course the specialized thing is faster. That's the entire point of specialization.

A fairer comparison would be:

torch.compile with a max-autotune profile
CuTe-based hand-written kernels from experts
Triton kernels (also GPU-specific, but typically hand-written)

On correctness validation: The paper checks correctness with 5 random inputs. For many ops this is fine. But numerical precision is full of corner cases — denormals, NaN/Inf propagation, accumulation order in reductions, half-precision saturation. Five random draws from a standard distribution won't cover these. A more rigorous correctness evaluation matters for production use.

Why the Core Conclusion Still Holds

Despite my complaints about the benchmark comparison, I think the paper's central claim is valid: on narrow, well-defined tasks, RL can train an LLM to develop genuine, transferable optimization ability.

The evidence that's most convincing to me is the Level-3 results against proprietary models. Claude and Gemini are strong at following instructions and generating code, but they haven't been specifically trained to understand GPU profiling signals and iterate on optimization. The ~40 point gap on Level-3 isn't about one model being smarter — it's about one model having internalized a specific skill through training.

The staged training pipeline is also a real contribution. Multi-turn RL with tool use is hard to stabilize; the PPO warm-up → RFT → Critic pre-training → multi-turn sequence is a concrete recipe that works, and that's valuable information for anyone trying to apply similar techniques to other domains.

What would convince me it fully generalizes:

Results on multiple GPU architectures (not just H100)
Correctness evaluation with adversarial inputs (NaN/Inf, denormals, maximum values)
Comparison to Triton kernel generation approaches
Whether the optimization strategies transfer to new operator types not in the training set

The Bigger Picture

There's been a lot of excitement about "LLMs writing CUDA" as if it's a solved problem. It's not. But CUDA Agent is a meaningful step that shows what it looks like to take the problem seriously: proper data synthesis, a real execution environment with real profiling, and a training pipeline designed for the specifics of the task.

The result is a model that has learned something real: how to read a profiler, identify a bottleneck, and apply the right optimization technique. That's not nothing. That's actually pretty impressive for a system that started from language model pretraining.

I just don't think 2.11x over torch.compile is the right framing for why it's impressive.