candle-vllm social preview
inference-server671MIT

candle-vllm

Efficent platform for inference and serving local LLMs including an OpenAI compatible API server.

Updated Jun 8, 2026
Platforms
macos, docker
Pricing
free-open-source
Status
active
License
MIT

Deep dive

The full breakdown - performance, comparisons, and setup

candle-vllm

candle-vllm is a local inference server - Efficent platform for inference and serving local LLMs including an OpenAI compatible API server.

Overview

Efficient, easy-to-use platform for inference and serving local LLMs including an OpenAI compatible API server.

  • OpenAI compatible API server provided for serving LLMs. - Highly extensible trait-based system to allow rapid implementation of new module pipelines, - Streaming support in generation. - Efficient management of key-value cache with PagedAttention. - Continuous batching (batched decoding for incoming requests over time). - 'In-situ' quantization (and 'In-situ' marlin format conversion) - 'GPTQ/Marlin' format quantization (4-bit) - Support 'Mac/Metal' devices - Support 'Multi-GPU' inference (both 'multi-process' and 'multi-threaded' mode) - Support 'Multi-node' inference with MPI runner - Support Chunked Prefilling (default chunk size 8K) - Support CUDA Graph - Support Model Context Protocol (MCP) and OpenAI-compatible tool calling - Support Prefix Caching - Support Block-wise FP8 Models (SM90+, Qwen3 Series) - Support FP8 KV Cache on all CUDA and Metal platforms - Support TurboQuant KV Cache (turbo8/turbo4/turbo3) with native flash attention kernels - Support Flashinfer Backend - Support manual YaRN RoPE scaling override from the command line via '--yarn-scaling-factor' - Support MXFP4/NVFP4 models

  • Currently, candle-vllm supports chat serving for the following model structures.

'PROGRAM_PARAM':--log --dtype bf16 --p 2000 --d 0,1 --gpu-memory-fraction 0.5 --isq q4k --prefill-chunk-size 8192 --frequency-penalty 1.1 --presence-penalty 1.1 --enforce-parser qwen_coder --yarn-scaling-factor 4.0

'MODEL_ID/MODEL_WEIGHT_PATH': --m Qwen/Qwen3.6-27B-FP8 (or '--w' specify local model path)

candle-vllm is open-source, written primarily in Rust, with 671 GitHub stars under the MIT license. It was last updated on 2026-05-26.

Key capabilities

From the project's documentation:

  • OpenAI compatible API server provided for serving LLMs.
  • Highly extensible trait-based system to allow rapid implementation of new module pipelines,
  • Streaming support in generation.
  • Efficient management of key-value cache with PagedAttention.
  • Continuous batching (batched decoding for incoming requests over time).
  • In-situ quantization (and In-situ marlin format conversion)

Install

A quick way to get started (always check the official docs for the latest):

go install --features cuda,nccl,flashinfer,cutlass --path .

How it fits a local-AI stack

candle-vllm runs on your own hardware, so pair it with a model and a GPU sized to your needs. Use the VRAM calculator to pick a model that fits your card, and see what you can run for hardware guidance. Related local inference servers in the directory:

Sources

Stats from GitHub, 2026-06-08.

Frequently asked

Quick answers to common questions

What is candle-vllm?

candle-vllm is a inference-server tool for local AI workloads. Efficent platform for inference and serving local LLMs including an OpenAI compatible API server.

Is candle-vllm free and open source?

Yes, candle-vllm has 671 GitHub stars and is licensed under MIT. You can self-host it for free on macos, docker.

What platforms does candle-vllm support?

candle-vllm runs on macos, docker.

What hardware do I need for candle-vllm?

The hardware requirements depend on which models you run. Check our hardware directory for compatible GPUs and systems. candle-vllm has 671 GitHub stars and an active community.

Does candle-vllm support GPU acceleration?

candle-vllm supports GPU acceleration via CUDA, Metal, or Vulkan depending on your platform. For the best performance, pair it with an NVIDIA RTX 4090 or 5090.

What are the best alternatives to candle-vllm?

Popular alternatives include other inference-server tools in our directory. Browse our full collection at /tool for comparisons, community reviews, and benchmark data to find the right fit for your workflow.

How much does candle-vllm cost?

candle-vllm is free-open-source. It is completely free and open source to self-host.

Pairs well with

Complementary tools, models, and hardware

Comments coming soon

Configure NEXT_PUBLIC_GISCUS_REPO_ID and NEXT_PUBLIC_GISCUS_CATEGORY_ID at giscus.app to enable.