whichllm
Find the local LLM that actually runs and performs best on your hardware. Ranked by real, recency-aware benchmarks, not parameter count. One command, run it in…
What it does
Core capabilities at a glance
- Apple Silicon
- Benchmarks
- CLI
- Command Line Tool
- Gguf
- GPU
- Huggingface
- Inference
Deep dive
The full breakdown - performance, comparisons, and setup
whichllm
whichllm is a local inference server - Find the local LLM that actually runs and performs best on your hardware. Ranked by real, recency-aware benchmarks, not parameter count. One command, run it instantly.
Overview
Find the best local LLM that actually runs on your hardware.
Auto-detects your GPU/CPU/RAM and ranks the top models from HuggingFace that fit your system.
After install, run 'whichllm' directly. For one-off runs, replace 'whichllm' with 'uvx whichllm@latest'.
The 32B model fits your card fine — whichllm still ranks the 27B #1, because it scores higher on real benchmarks and is a newer generation. A size-only "what fits?" tool would hand you the bigger one. That gap is the whole point of whichllm. (Note #3: a MoE model at 102 t/s — speed is ranked on active params, quality on total.)
Fitting a model into your VRAM is the easy part. The hard part is knowing which of the models that fit is actually the best — and that is what whichllm is built to get right.
whichllm is open-source, written primarily in Python, with 3,122 GitHub stars under the MIT license. The latest release is v0.5.8 (2026-06-05).
Key capabilities
From the project's documentation:
- Evidence-based ranking, not a size heuristic — The top pick is
- Recency-aware — Stale leaderboards are demoted along each model's
- Evidence-graded and guarded — Every score is tagged
- Architecture-aware estimates — VRAM = weights + GQA KV cache +
- One command, scriptable — whichllm prints the answer; add
- Live data — Models fetched directly from the HuggingFace API, with
Install
A quick way to get started (always check the official docs for the latest):
pip install whichllmHow it fits a local-AI stack
whichllm runs on your own hardware, so pair it with a model and a GPU sized to your needs. Use the VRAM calculator to pick a model that fits your card, and see what you can run for hardware guidance. Related local inference servers in the directory:
Sources
- Source code & docs: Andyyyy64/whichllm
Stats from GitHub, 2026-06-08.
Frequently asked
Quick answers to common questions
What is whichllm?
whichllm is a inference-server tool for local AI workloads. Find the local LLM that actually runs and performs best on your hardware. Ranked by real, recency-aware benchmarks, not parameter count. One command, run it in…
Is whichllm free and open source?
Yes, whichllm has 3,122 GitHub stars and is licensed under MIT. You can self-host it for free on .
What hardware do I need for whichllm?
The hardware requirements depend on which models you run. Check our hardware directory for compatible GPUs and systems. whichllm has 3,122 GitHub stars and an active community.
Does whichllm support GPU acceleration?
whichllm's GPU support depends on your specific setup. Check the documentation for details. For the best performance, pair it with an NVIDIA RTX 4090 or 5090.
What are the best alternatives to whichllm?
Popular alternatives include other inference-server tools in our directory. Browse our full collection at /tool for comparisons, community reviews, and benchmark data to find the right fit for your workflow.
How much does whichllm cost?
whichllm is free-open-source. It is completely free and open source to self-host.
Pairs well with
Complementary tools, models, and hardware
Comments coming soon
Configure NEXT_PUBLIC_GISCUS_REPO_ID and NEXT_PUBLIC_GISCUS_CATEGORY_ID at giscus.app to enable.