Tag: CUDA

Faster Diffusion Sampling: Numerical Methods Meet Score-Based Models

Diffusion models generate stunning images, but they're slow. A single image requires dozens of sequential neural network evaluations — each one a full forward pass through a U-Net. DPM-Solver++ brought that down to 10-20 steps with reasonable quality, and it's the current state of the art. But what if we could do better by borrowing techniques that the scientific computing community has used for decades?

baka_mashiroAbout 8 min

CUDA Agent Paper Review: Teaching LLMs to Write Fast GPU Kernels via RL

CUDA Agent: Large-Scale Agentic RL for High-Performance CUDA Kernel Generation
ByteDance Seed + Tsinghua AIR (SIA-Lab), 2026
cuda-agent.github.io

Writing fast GPU kernels is genuinely hard. You need to understand memory hierarchy, warp scheduling, bank conflicts, tensor core layouts, and about fifty other microarchitectural details that change between GPU generations. Most engineers — including most ML engineers — don't have this knowledge. They use libraries (cuBLAS, cuDNN, FlashAttention) and hope for the best.

baka_mashiroAbout 4 min