Moonshot AI open-sources FlashKDA attention kernels with 2.2x speedup

The CUTLASS-based kernel for Kimi Delta Attention drops in as a flash-linear-attention backend, hitting 1.72–2.22x prefill speedup on Nvidia H20 GPUs.

Moonshot AI released FlashKDA on April 30, an open-source CUTLASS-based CUDA kernel for the Kimi Delta Attention mechanism that powers its Kimi-K2.6 model. Published on GitHub under MIT license, the kernel reports 1.72x to 2.22x prefill speedup over the flash-linear-attention baseline on Nvidia H20 GPUs, with the peak appearing in variable-length batched workloads — the realistic case for inference serving.

The integration story is the news. FlashKDA auto-dispatches from flash-linear-attention's existing chunk_kda interface, meaning codebases already using flash-linear-attention pick up the speedup with no manual rewiring. It targets Hopper-class hardware (H100, H20 and above) with CUDA 12.9+ and PyTorch 2.4+, fixed at head dimension 128, and supports cu_seqlens-style packed batching natively.

Linear-attention variants like KDA matter for one reason: long context costs less. Standard attention scales quadratically with sequence length; delta and linear attention trade some expressivity for near-linear cost. By open-sourcing the production kernel — not just the architecture paper — Moonshot is doing what DeepSeek did with its inference stack: removing the implementation gap that usually keeps academic attention variants out of real systems. Open-source kernels that drop into a popular framework are how an architecture actually gets adopted.

Takeaway for learners: model architecture papers rarely change the world on their own. What changes the world is when someone ships the kernels that make the architecture practical on commodity hardware. If you want to understand which efficiency techniques will actually shape AI in the next year, watch the GitHub repos of the labs releasing kernels and serving stacks, not just model weights.