SwiftLM
⚡ Native MLX Swift LLM inference server for Apple Silicon. OpenAI-compatible API, SSD streaming for 100B+ MoE models, TurboQuant KV cache compression, MACOS +…
What it does
Core capabilities at a glance
- Apple Sili
- Inference
- IOS
- Metal
- MLX
- MOE
- ON Device AI
- Openai API
Deep dive
The full breakdown - performance, comparisons, and setup
SwiftLM
SwiftLM is a local inference server - ⚡ Native MLX Swift LLM inference server for Apple Silicon. OpenAI-compatible API, SSD streaming for 100B+ MoE models, TurboQuant KV cache compression, MACOS + iOS iPhone app.
Overview
A blazingly fast, native Swift inference server that serves MLX models with a strict OpenAI-compatible API.
No Python runtime, no Global Interpreter Lock (GIL), no unnecessary memory copies. Just bare-metal Apple Silicon performance compiled to a single binary.
Download the latest release tarball from the Releases page. The archive is self-contained — 'mlx.metallib' is bundled alongside the binary.
The build script handles everything: submodules, cmake, Metal kernel compilation, and the Swift build.
This will: 1. Initialize git submodules 2. Install 'cmake' via Homebrew (if not already installed) 3. Compile 'mlx.metallib' from the Metal kernel sources 4. Build the 'SwiftLM' binary in release mode
(Add '--stream-experts' when running oversized MoE models to bypass macOS virtual memory swapping and stream expert layers directly from NVMe SSD.)
Benchmarked with 'gemma-4-26b-a4b-it-4bit' running three configurations across 512 / 40K / 100K token contexts.
* Time-weighted average: 'total_tokens / sum(60/TPS)' — correct wall-clock representation vs arithmetic mean.
SwiftLM is open-source, written primarily in Swift, with 685 GitHub stars under the MIT license. The latest release is b648 (2026-05-08).
Key capabilities
From the project's documentation:
- 🚀 1.81× avg throughput — MTP + TurboQuant delivers 66.2 tok/s time-weighted vs 36.6 tok/s baseline
- 🏎️ Nearly 2× faster TTFT at 100K context — 33.95s vs 63.11s baseline (46% reduction)
- 🔬 MTP alone is free — 1.10× time-weighted TPS and lower TTFT with zero additional memory overhead
- Peak physical RAM stays ≤ 17 GB across all configurations — the 126 GB model streams the rest from NVMe SSD.
- 🍎 100% Native Apple Silicon: Powered natively by Metal and Swift.
- 🔌 OpenAI-compatible: Drop-in replacement for OpenAI SDKs (/v1/chat/completions, streaming, etc).
How it fits a local-AI stack
SwiftLM runs on your own hardware, so pair it with a model and a GPU sized to your needs. Use the VRAM calculator to pick a model that fits your card, and see what you can run for hardware guidance. Related local inference servers in the directory:
Sources
- Source code & docs: SharpAI/SwiftLM
Stats from GitHub, 2026-06-08.
Frequently asked
Quick answers to common questions
What is SwiftLM?
SwiftLM is a inference-server tool for local AI workloads. ⚡ Native MLX Swift LLM inference server for Apple Silicon. OpenAI-compatible API, SSD streaming for 100B+ MoE models, TurboQuant KV cache compression, MACOS +…
Is SwiftLM free and open source?
Yes, SwiftLM has 685 GitHub stars and is licensed under MIT. You can self-host it for free on macos.
What platforms does SwiftLM support?
SwiftLM runs on macos.
What hardware do I need for SwiftLM?
The hardware requirements depend on which models you run. Check our hardware directory for compatible GPUs and systems. SwiftLM has 685 GitHub stars and an active community.
Does SwiftLM support GPU acceleration?
SwiftLM supports GPU acceleration via CUDA, Metal, or Vulkan depending on your platform. For the best performance, pair it with an NVIDIA RTX 4090 or 5090.
What are the best alternatives to SwiftLM?
Popular alternatives include other inference-server tools in our directory. Browse our full collection at /tool for comparisons, community reviews, and benchmark data to find the right fit for your workflow.
How much does SwiftLM cost?
SwiftLM is free-open-source. It is completely free and open source to self-host.
Pairs well with
Complementary tools, models, and hardware
Comments coming soon
Configure NEXT_PUBLIC_GISCUS_REPO_ID and NEXT_PUBLIC_GISCUS_CATEGORY_ID at giscus.app to enable.