Rapid-MLX social preview
inference-server2,691Apache 2.0

Rapid-MLX

The fastest local AI engine for Apple Silicon. 4.2x faster than Ollama, 0.08s cached TTFT, 100% tool calling. 17 tool parsers, prompt cache, reasoning separati…

Updated Jun 8, 2026
Platforms
macos
Pricing
free-open-source
Status
active
License
Apache 2.0

What it does

Core capabilities at a glance

  • Apple Silicon
  • Claude Code
  • Cursor
  • Deepseek
  • Fastapi
  • Inference
  • Local LLM
  • M1

Deep dive

The full breakdown - performance, comparisons, and setup

Rapid-MLX

Rapid-MLX is a local inference server - The fastest local AI engine for Apple Silicon. 4.2x faster than Ollama, 0.08s cached TTFT, 100% tool calling. 17 tool parsers, prompt cache, reasoning separation, cloud routing. Drop-in OpenAI replacement. Works with Claude Code, Cursor, Aider.

Overview

Run local AI models on your Mac — no cloud, no API costs. Works with Cursor, Claude Code, and any OpenAI-compatible app.

pip install → serve Gemma 4 26B → chat + tool calling → works with PydanticAI, LangChain, Aider, and more.

Single-user end-to-end throughput (B=1: one request at a time, 256 max output tokens, 'output_tokens / wall-clock' incl. first-token latency), median of 3 rounds. 'chat_template_kwargs.enable_thinking=False' passed where the engine honours it. Tested on M3 Ultra 256 GB / rapid-mlx v0.6.80. ¹ carried over from 2026-04 bench — disk-constrained on this refresh.

Rapid-MLX is open-source, written primarily in Python, with 2,690 GitHub stars under the Apache 2.0 license. The latest release is v0.6.82 (2026-06-07).

Key capabilities

From the project's documentation:

  • tok/s (tokens per second) — roughly how many words the AI generates per second. Higher = faster.
  • 4bit / 8bit — compression levels for models. 4bit uses less memory (recommended); 8bit is higher quality.
  • TTFT (Time To First Token) — how long before the AI starts responding.
  • Tool calling — the AI can call functions in your code. Used by Cursor, Claude Code, and coding assistants.
  • Subcommand names (serve / chat / agents / bench / doctor)
  • Model alias names (qwen3.5-9b) or canonical HF repo IDs (mlx-community/...) — local paths are redacted to

Install

A quick way to get started (always check the official docs for the latest):

pip install rapid-mlx

How it fits a local-AI stack

Rapid-MLX runs on your own hardware, so pair it with a model and a GPU sized to your needs. Use the VRAM calculator to pick a model that fits your card, and see what you can run for hardware guidance. Related local inference servers in the directory:

Sources

Stats from GitHub, 2026-06-08.

Frequently asked

Quick answers to common questions

What is Rapid-MLX?

Rapid-MLX is a inference-server tool for local AI workloads. The fastest local AI engine for Apple Silicon. 4.2x faster than Ollama, 0.08s cached TTFT, 100% tool calling. 17 tool parsers, prompt cache, reasoning separati…

Is Rapid-MLX free and open source?

Yes, Rapid-MLX has 2,691 GitHub stars and is licensed under Apache 2.0. You can self-host it for free on macos.

What platforms does Rapid-MLX support?

Rapid-MLX runs on macos.

What hardware do I need for Rapid-MLX?

The hardware requirements depend on which models you run. Check our hardware directory for compatible GPUs and systems. Rapid-MLX has 2,691 GitHub stars and an active community.

Does Rapid-MLX support GPU acceleration?

Rapid-MLX supports GPU acceleration via CUDA, Metal, or Vulkan depending on your platform. For the best performance, pair it with an NVIDIA RTX 4090 or 5090.

What are the best alternatives to Rapid-MLX?

Popular alternatives include other inference-server tools in our directory. Browse our full collection at /tool for comparisons, community reviews, and benchmark data to find the right fit for your workflow.

How much does Rapid-MLX cost?

Rapid-MLX is free-open-source. It is completely free and open source to self-host.

Pairs well with

Complementary tools, models, and hardware

Comments coming soon

Configure NEXT_PUBLIC_GISCUS_REPO_ID and NEXT_PUBLIC_GISCUS_CATEGORY_ID at giscus.app to enable.