flashinfer social preview
inference-server5,762Apache 2.0

flashinfer

FlashInfer: Kernel Library for LLM Serving

Updated Jun 8, 2026
Platforms
web
Pricing
free-open-source
Status
active
License
Apache 2.0

What it does

Core capabilities at a glance

  • Attention
  • Cuda
  • Distributed Inference
  • GPU
  • JIT
  • Large Large Models
  • LLM Inference
  • MOE

Deep dive

The full breakdown - performance, comparisons, and setup

flashinfer

flashinfer is a local inference server - FlashInfer: Kernel Library for LLM Serving.

Overview

FlashInfer is a library and kernel generator for inference that delivers state-of-the-art performance across diverse GPU architectures. It provides unified APIs for attention, GEMM, and MoE operations with multiple backend implementations including FlashAttention-2/3, cuDNN, CUTLASS, and TensorRT-LLM.

  • State-of-the-art Performance: Optimized kernels for prefill, decode, and mixed batching scenarios - Multiple Backends: Automatically selects the best backend for your hardware and workload - Modern Architecture Support: Support for SM75 (Turing) and later (through Blackwell) - Low-Precision Compute: FP8 and FP4 quantization for attention, GEMM, and MoE operations - Production-Ready: CUDAGraph and torch.compile compatible for low-latency serving

  • BF16 GEMM: BF16 matrix multiplication for SM10.0+ GPUs. - FP8 GEMM: Per-tensor and groupwise scaling - FP4 GEMM: NVFP4 and MXFP4 matrix multiplication for Blackwell GPUs - Grouped GEMM: Efficient batched matrix operations for LoRA and multi-expert routing

  • RoPE: LLaMA-style rotary position embeddings (including LLaMA 3.1) - Normalization: RMSNorm, LayerNorm, Gemma-style fused operations - Activations: SiLU, GELU with fused gating

Notable updates: - [2025-10-08] Blackwell support added in v0.4.0 - [2025-03-10] Blog Post Sorting-Free GPU Kernels for LLM Sampling, which explains the design of sampling kernels in FlashInfer.

See documentation for comprehensive API reference and tutorials.

FlashInfer provides comprehensive API logging for debugging. Enable it using environment variables:

flashinfer is open-source, written primarily in Python, with 5,760 GitHub stars under the Apache 2.0 license. The latest release is v0.6.12 (2026-05-29).

Key capabilities

From the project's documentation:

  • State-of-the-art Performance: Optimized kernels for prefill, decode, and mixed batching scenarios
  • Multiple Backends: Automatically selects the best backend for your hardware and workload
  • Modern Architecture Support: Support for SM75 (Turing) and later (through Blackwell)
  • Low-Precision Compute: FP8 and FP4 quantization for attention, GEMM, and MoE operations
  • Production-Ready: CUDAGraph and torch.compile compatible for low-latency serving
  • Paged and Ragged KV-Cache: Efficient memory management for dynamic batch serving

Install

A quick way to get started (always check the official docs for the latest):

pip install flashinfer-python

How it fits a local-AI stack

flashinfer runs on your own hardware, so pair it with a model and a GPU sized to your needs. Use the VRAM calculator to pick a model that fits your card, and see what you can run for hardware guidance. Related local inference servers in the directory:

Sources

Stats from GitHub, 2026-06-08.

Frequently asked

Quick answers to common questions

What is flashinfer?

flashinfer is a inference-server tool for local AI workloads. FlashInfer: Kernel Library for LLM Serving

Is flashinfer free and open source?

Yes, flashinfer has 5,762 GitHub stars and is licensed under Apache 2.0. You can self-host it for free on web.

What platforms does flashinfer support?

flashinfer runs on web.

What hardware do I need for flashinfer?

The hardware requirements depend on which models you run. Check our hardware directory for compatible GPUs and systems. flashinfer has 5,762 GitHub stars and an active community.

Does flashinfer support GPU acceleration?

flashinfer's GPU support depends on your specific setup. Check the documentation for details. For the best performance, pair it with an NVIDIA RTX 4090 or 5090.

What are the best alternatives to flashinfer?

Popular alternatives include other inference-server tools in our directory. Browse our full collection at /tool for comparisons, community reviews, and benchmark data to find the right fit for your workflow.

How much does flashinfer cost?

flashinfer is free-open-source. It is completely free and open source to self-host.

Pairs well with

Complementary tools, models, and hardware

Comments coming soon

Configure NEXT_PUBLIC_GISCUS_REPO_ID and NEXT_PUBLIC_GISCUS_CATEGORY_ID at giscus.app to enable.