Diffusion models generate stunning images, but they're slow. A single image requires dozens of sequential neural network evaluations — each one a full forward pass through a U-Net. DPM-Solver++ brought that down to 10-20 steps with reasonable quality, and it's the current state of the art. But what if we could do better by borrowing techniques that the scientific computing community has used for decades?
About 8 min
CUDA Agent: Large-Scale Agentic RL for High-Performance CUDA Kernel Generation
ByteDance Seed + Tsinghua AIR (SIA-Lab), 2026
cuda-agent.github.io
Writing fast GPU kernels is genuinely hard. You need to understand memory hierarchy, warp scheduling, bank conflicts, tensor core layouts, and about fifty other microarchitectural details that change between GPU generations. Most engineers — including most ML engineers — don't have this knowledge. They use libraries (cuBLAS, cuDNN, FlashAttention) and hope for the best.
About 4 min
