[Paper Notes] CUDA Agent: Large-Scale Agentic RL for High-Performance CUDA Kernel Generation
Published:
This post supports English / 中文 switching via the site language toggle in the top navigation.
TL;DR
LLMs are decent at general coding but struggle to beat torch.compile at writing optimized CUDA kernels. CUDA Agent fixes this with large-scale agentic reinforcement learning: a scalable data pipeline (6K synthesized operator tasks), a skill-augmented CUDA development environment with anti-hacking safeguards, and multi-stage warm-up techniques (RFT + Value Pretraining) that stabilize 150-step RL training. Result: 100%, 100%, 92% faster rate over torch.compile on KernelBench Level-1/2/3 – blowing past Claude Opus 4.5 and Gemini 3 Pro by ~40% on the hardest split.
Paper Info
- Title: CUDA Agent: Large-Scale Agentic RL for High-Performance CUDA Kernel Generation
- Authors: Weinan Dai*, Hanlin Wu*, Qiying Yu, Huan-ang Gao, Jiahao Li, Chengquan Jiang, Weiqiang Lou, Yufan Song, Hongli Yu, Jiaze Chen, Wei-Ying Ma, Ya-Qin Zhang, Jingjing Liu, Mingxuan Wang, Xin Liu, Hao Zhou
- Affiliations: ByteDance Seed, Institute for AI Industry Research (AIR) Tsinghua University, SIA-Lab
- arXiv: 2602.24286
- Date: March 2, 2026
- Paper type: LLM agent training / CUDA kernel optimization / reinforcement learning
1. Problem and Motivation
GPU kernel optimization is critical for deep learning performance but requires deep hardware expertise. Despite LLMs excelling at general programming, they remain uncompetitive with compiler-based systems like torch.compile for CUDA kernel generation.
Existing approaches fall into two camps, both with fundamental limits:
- Training-free refinement (STARK, ReGraphT, EvoEngineer): hand-designed heuristics with execution feedback. Performance is capped by the base model’s intrinsic CUDA ability.
- Fine-tuning within fixed loops (Kevin, CUDA-L1, ConCuR): multi-turn execution-feedback pipelines. These waste context on all previous solutions and constrain the agent’s autonomy to learn its own debugging/profiling strategies.
Neither paradigm fundamentally improves the model’s intrinsic CUDA optimization capability.
2. Method
CUDA Agent has three pillars: data, environment, and RL techniques.
2.1 Scalable Data Synthesis Pipeline
High-quality CUDA training data is scarce. The pipeline:
- Seed Problem Crawling: mine reference operators from
torchandtransformerslibraries - Combinatorial Synthesis: LLMs sample up to 5 operator classes and compose them into fused tasks – because fused problems require joint optimization (shared registers/SMEM/occupancy), not just chaining individually-optimized ops
- Rubric-based Filtering: keep only executable, deterministic, non-trivial problems with 1-100ms eager runtime; exclude KernelBench-similar cases
This yields CUDA-Agent-Ops-6K, a curated operator-level dataset.
2.2 Skill-Integrated Agent Loop
The agent loop follows a ReAct-style paradigm (reason → act → observe) compatible with OpenHands, using standard tools (BashTool, GlobTool, MultiEditTool, TodoWriteTool). On top of this, CUDA Agent gets a SKILL.md that formalizes the CUDA optimization workflow:
- Profile the native PyTorch implementation to find bottlenecks
- Implement custom CUDA operators targeting identified hotspots
- Compile and evaluate in a GPU sandbox, iterate until correct
- Repeat until ≥5% speedup over
torch.compile
Anti-reward-hacking measures (this is a highlight):
- Evaluation scripts are file-permission protected – the agent can’t modify them
- Context managers forbid fallback to
torch.nn.functional - 5 random inputs for correctness validation
- Proper GPU synchronization, warm-up iterations, and averaged measurements
- No web search tools – all solutions from local environment only
2.3 Robust Reward Scheduling
Instead of raw speedup (noisy, biased toward easy kernels), they use a discrete milestone-based reward:
\[r = \begin{cases} -1 & \text{if correctness check fails} \\ 3 & \text{if faster than both eager and compile} \\ 2 & \text{if faster than eager only} \\ 1 & \text{otherwise (correct but not faster)} \end{cases}\]where “faster” means >5% speedup. This normalized reward avoids outlier-driven optimization.
2.4 Multi-Stage Warm-up for Stable Training
The core training instability comes from domain distribution mismatch – CUDA code is <0.01% of pretraining data, causing low-probability tokens and exploding importance sampling ratios.
The fix is a multi-stage warm-up:
- Single-Turn RL: standard PPO on the base model (Seed1.6, 23B active / 230B total MoE) for basic CUDA generation ability
- Actor Initialization (RFT): collect agent trajectories from the single-turn model, reject-sample for high-quality ones (positive reward, no hallucinated tool calls), fine-tune
- Critic Initialization (Value Pretraining): pretrain the critic on trajectory data with GAE targets so it can immediately provide useful advantage estimates
- Agentic RL: full PPO with 128K context, up to 150 agent turns during training (200 at eval)
Without RFT → entropy explodes → policy collapses within ~17 steps. Without Value Pretraining → critic can’t estimate values → trajectory lengths explode.
3. Experiments and Main Results
Benchmark: KernelBench (250 operator tasks across Level 1-3).
Setup: Seed1.6 base model, batch size 1024, 128 H20 GPUs for profiling, Docker-based sandboxes.
Key Results (vs. torch.compile)
| Model | L1 Faster Rate | L2 Faster Rate | L3 Faster Rate | L3 Speedup |
|---|---|---|---|---|
| GLM 4.6 | 32% | 11% | 10% | 0.62x |
| Kimi K2 | 39% | 15% | 6% | 0.29x |
| Gemini 3 Pro | 72% | 76% | 52% | 1.17x |
| Claude Opus 4.5 | 72% | 69% | 50% | 1.10x |
| CUDA Agent | 97% | 100% | 90% | 1.52x |
Three takeaways:
- CUDA Agent massively outperforms proprietary models, especially on harder tasks (~40% gap on L3)
- Level 2 (operator sequences / fusion) shows the biggest win: 100% faster rate, 2.80x speedup. Compiler heuristics can’t handle non-trivial fusion patterns; the agent explores a much larger design space
- Learned optimization policies can consistently beat static compiler heuristics
Ablation Highlights
| Variant | Overall Faster Rate (vs. Compile) | Overall Speedup (vs. Compile) |
|---|---|---|
| w/o Agent Loop | 14.1% | 0.69x |
| w/o Robust Reward | 60.4% | 1.25x |
| w/o RFT | 49.8% | 1.05x |
| w/o Value Pretraining | 50.9% | 1.00x |
| Full CUDA Agent | 96.8% | 2.11x |
Every component matters, but the agent loop is the biggest contributor – without it, the model can barely beat torch.compile at all.
4. Strengths and Limitations
Strengths:
- Comprehensive anti-reward-hacking design (file permissions, fallback prevention, measurement rigor)
- The multi-stage warm-up is well-motivated and cleanly ablated – each stage addresses a specific instability mode
- The combinatorial data synthesis is clever: fused operator tasks create genuinely novel optimization challenges
- SOTA results with large margins, especially on the harder splits
Limitations:
- Built on Seed1.6 (230B MoE) – unclear how much transfers to smaller models
- KernelBench is the only benchmark; real-world kernel optimization has additional complexities (library integration, memory management across kernel boundaries)
- 128 H20 GPUs for the sandbox pool is a significant resource requirement
- The paper notes ChatGPT-5 series models “declined to respond to CUDA-related prompts” – an interesting but unelaborated observation
5. Takeaways
- Agentic RL > fixed pipelines for code optimization tasks. Letting the model learn its own debugging and profiling strategies (via agent loop) is far more effective than constraining it to pre-designed multi-turn templates.
- Domain-specific warm-up is critical when the target domain is a tiny fraction of pretraining data. The RFT → Value Pretraining → PPO pipeline is a reusable recipe for other low-resource agentic RL domains.
- Discrete milestone rewards > continuous speedup rewards for optimization tasks with noisy measurements. The robust reward design avoids outlier-driven training while still incentivizing genuine performance gains.
- LLM-based kernel generation is now competitive with (and often superior to) compiler-driven optimization. This is a significant milestone – it suggests a path where AI agents handle performance-critical low-level optimization that currently requires deep hardware expertise.
