whichllm social preview
inference-server3,122MIT

whichllm

Find the local LLM that actually runs and performs best on your hardware. Ranked by real, recency-aware benchmarks, not parameter count. One command, run it in…

Updated Jun 8, 2026
Platforms
Pricing
free-open-source
Status
active
License
MIT

What it does

Core capabilities at a glance

  • Apple Silicon
  • Benchmarks
  • CLI
  • Command Line Tool
  • Gguf
  • GPU
  • Huggingface
  • Inference

Deep dive

The full breakdown - performance, comparisons, and setup

whichllm

whichllm is a local inference server - Find the local LLM that actually runs and performs best on your hardware. Ranked by real, recency-aware benchmarks, not parameter count. One command, run it instantly.

Overview

Find the best local LLM that actually runs on your hardware.

Auto-detects your GPU/CPU/RAM and ranks the top models from HuggingFace that fit your system.

After install, run 'whichllm' directly. For one-off runs, replace 'whichllm' with 'uvx whichllm@latest'.

The 32B model fits your card fine — whichllm still ranks the 27B #1, because it scores higher on real benchmarks and is a newer generation. A size-only "what fits?" tool would hand you the bigger one. That gap is the whole point of whichllm. (Note #3: a MoE model at 102 t/s — speed is ranked on active params, quality on total.)

Fitting a model into your VRAM is the easy part. The hard part is knowing which of the models that fit is actually the best — and that is what whichllm is built to get right.

whichllm is open-source, written primarily in Python, with 3,122 GitHub stars under the MIT license. The latest release is v0.5.8 (2026-06-05).

Key capabilities

From the project's documentation:

  • Evidence-based ranking, not a size heuristic — The top pick is
  • Recency-aware — Stale leaderboards are demoted along each model's
  • Evidence-graded and guarded — Every score is tagged
  • Architecture-aware estimates — VRAM = weights + GQA KV cache +
  • One command, scriptable — whichllm prints the answer; add
  • Live data — Models fetched directly from the HuggingFace API, with

Install

A quick way to get started (always check the official docs for the latest):

pip install whichllm

How it fits a local-AI stack

whichllm runs on your own hardware, so pair it with a model and a GPU sized to your needs. Use the VRAM calculator to pick a model that fits your card, and see what you can run for hardware guidance. Related local inference servers in the directory:

Sources

Stats from GitHub, 2026-06-08.

Frequently asked

Quick answers to common questions

What is whichllm?

whichllm is a inference-server tool for local AI workloads. Find the local LLM that actually runs and performs best on your hardware. Ranked by real, recency-aware benchmarks, not parameter count. One command, run it in…

Is whichllm free and open source?

Yes, whichllm has 3,122 GitHub stars and is licensed under MIT. You can self-host it for free on .

What hardware do I need for whichllm?

The hardware requirements depend on which models you run. Check our hardware directory for compatible GPUs and systems. whichllm has 3,122 GitHub stars and an active community.

Does whichllm support GPU acceleration?

whichllm's GPU support depends on your specific setup. Check the documentation for details. For the best performance, pair it with an NVIDIA RTX 4090 or 5090.

What are the best alternatives to whichllm?

Popular alternatives include other inference-server tools in our directory. Browse our full collection at /tool for comparisons, community reviews, and benchmark data to find the right fit for your workflow.

How much does whichllm cost?

whichllm is free-open-source. It is completely free and open source to self-host.

Pairs well with

Complementary tools, models, and hardware

Comments coming soon

Configure NEXT_PUBLIC_GISCUS_REPO_ID and NEXT_PUBLIC_GISCUS_CATEGORY_ID at giscus.app to enable.