What it does
Core capabilities at a glance
- Ascend
- Cuda
- Deepseek
- Distributed Inference
- Genai
- High Performance Inference
- Inference
- Llama
Deep dive
The full breakdown - performance, comparisons, and setup
gpustack
gpustack is a local inference server - A GPU cluster manager that configures and orchestrates inference engines like vLLM and SGLang for high-performance AI model deployment.
Overview
GPUStack is an open-source GPU cluster manager designed for efficient AI model deployment. It configures and orchestrates inference engines — vLLM, SGLang, TensorRT-LLM, or your own — to optimize performance across GPU clusters. Its core features include: - Multi-Cluster GPU Management. Manages GPU clusters across multiple environments. This includes on-premises servers, Kubernetes clusters, and cloud providers. - Pluggable Inference Engines. Automatically configures high-performance inference engines such as vLLM, SGLang, and TensorRT-LLM. You can also add custom inference engines as needed. - Day 0 Model Support. GPUStack's pluggable engine architecture enables you to deploy new models on the day they are released. - Performance-Optimized Configurations. Offers pre-tuned modes for low latency or high throughput. GPUStack supports extended KV cache systems like LMCache and HiCache to reduce TTFT. It also includes built-in support for speculative decoding methods such as EAGLE3, MTP, and N-grams. - Enterprise-Grade Operations. Offers support for automated failure recovery, load balancing, monitoring, authentication, and access control.
GPUStack enables development teams, IT organizations, and service providers to deliver Model-as-a-Service at scale. It supports industry-standard APIs for LLM, voice, image, and video models. The platform includes built-in user authentication and access control, real-time monitoring of GPU performance and utilization, and detailed metering of token usage and API request rates.
gpustack is open-source, written primarily in Python, with 5,118 GitHub stars under the Apache 2.0 license. The latest release is v2.1.2 (2026-04-21).
Install
A quick way to get started (always check the official docs for the latest):
docker run -d --name gpustack \How it fits a local-AI stack
gpustack runs on your own hardware, so pair it with a model and a GPU sized to your needs. Use the VRAM calculator to pick a model that fits your card, and see what you can run for hardware guidance. Related local inference servers in the directory:
Sources
- Source code & docs: gpustack/gpustack
- Official website: https://gpustack.ai
Stats from GitHub, 2026-06-08.
Frequently asked
Quick answers to common questions
What is gpustack?
gpustack is a inference-server tool for local AI workloads. A GPU cluster manager that configures and orchestrates inference engines like vLLM and SGLang for high-performance AI model deployment.
Is gpustack free and open source?
Yes, gpustack has 5,119 GitHub stars and is licensed under Apache 2.0. You can self-host it for free on docker.
What platforms does gpustack support?
gpustack runs on docker.
What hardware do I need for gpustack?
The hardware requirements depend on which models you run. Check our hardware directory for compatible GPUs and systems. gpustack has 5,119 GitHub stars and an active community.
Does gpustack support GPU acceleration?
gpustack's GPU support depends on your specific setup. Check the documentation for details. For the best performance, pair it with an NVIDIA RTX 4090 or 5090.
What are the best alternatives to gpustack?
Popular alternatives include other inference-server tools in our directory. Browse our full collection at /tool for comparisons, community reviews, and benchmark data to find the right fit for your workflow.
How much does gpustack cost?
gpustack is free-open-source. It is completely free and open source to self-host.
Pairs well with
Complementary tools, models, and hardware
Comments coming soon
Configure NEXT_PUBLIC_GISCUS_REPO_ID and NEXT_PUBLIC_GISCUS_CATEGORY_ID at giscus.app to enable.