What it does

Core capabilities at a glance

Codellama
Cuda Kernels
Deepspeed
Fastertransformer
Internlm
Llama
Llama2
Llama3

Deep dive

The full breakdown - performance, comparisons, and setup

lmdeploy

lmdeploy is a local inference server - LMDeploy is a toolkit for compressing, deploying, and serving LLMs.

Overview

[2026/04] PyPI has expanded the storage quota for LMDeploy and wheel uploads have resumed. 'v0.12.3' is now available on PyPI, so you can install it directly via 'pip install lmdeploy'. - [2026/02] Support Qwen3.5 - [2026/02] Support vllm-project/llm-compressor 4bit symmetric/asymmetric quantization. Refer here for detailed guide
[2025/09] TurboMind supports MXFP4 on NVIDIA GPUs starting from V100, achieving 1.5x the performmance of vLLM on H800 for openai gpt-oss models! - [2025/06] Comprehensive inference optimization for FP8 MoE Models - [2025/06] DeepSeek PD Disaggregation deployment is now supported through integration with DLSlime and Mooncake. Huge thanks to both teams! - [2025/04] Enhance DeepSeek inference performance by integration deepseek-ai techniques: FlashMLA, DeepGemm, DeepEP, MicroBatch and eplb - [2025/01] Support DeepSeek V3 and R1

lmdeploy is open-source, written primarily in Python, with 7,885 GitHub stars under the Apache 2.0 license. The latest release is v0.13.0 (2026-05-12).

Key capabilities

From the project's documentation:

[2025/06] Comprehensive inference optimization for FP8 MoE Models
[2025/01] Support DeepSeek V3 and R1
[2024/11] Support Mono-InternVL with PyTorch engine
[2024/10] PyTorchEngine supports graph mode on ascend platform, doubling the inference speed
[2024/09] LMDeploy PyTorchEngine achieves 1.3x faster on Llama3-8B inference by introducing CUDA graph
[2024/07] Support Llama3.1 8B, 70B and its TOOLS CALLING

Install

A quick way to get started (always check the official docs for the latest):

pip install lmdeploy

How it fits a local-AI stack

lmdeploy runs on your own hardware, so pair it with a model and a GPU sized to your needs. Use the VRAM calculator to pick a model that fits your card, and see what you can run for hardware guidance. Related local inference servers in the directory:

Sources

Source code & docs: InternLM/lmdeploy
Official website: https://lmdeploy.readthedocs.io/en/latest

Stats from GitHub, 2026-06-08.

Frequently asked

Quick answers to common questions

What is lmdeploy?

lmdeploy is a inference-server tool for local AI workloads. LMDeploy is a toolkit for compressing, deploying, and serving LLMs.

Is lmdeploy free and open source?

Yes, lmdeploy has 7,971 GitHub stars and is licensed under Apache 2.0. You can self-host it for free on .

What hardware do I need for lmdeploy?

The hardware requirements depend on which models you run. Check our hardware directory for compatible GPUs and systems. lmdeploy has 7,971 GitHub stars and an active community.

Does lmdeploy support GPU acceleration?

lmdeploy's GPU support depends on your specific setup. Check the documentation for details. For the best performance, pair it with an NVIDIA RTX 4090 or 5090.

What are the best alternatives to lmdeploy?

Popular alternatives include other inference-server tools in our directory. Browse our full collection at /tool for comparisons, community reviews, and benchmark data to find the right fit for your workflow.

How much does lmdeploy cost?

lmdeploy is free-open-source. It is completely free and open source to self-host.

Pairs well with

Complementary tools, models, and hardware

lmdeploy

What it does

Deep dive

lmdeploy

Overview

Key capabilities

Install

How it fits a local-AI stack

Sources

Frequently asked

What is lmdeploy?

Is lmdeploy free and open source?

What hardware do I need for lmdeploy?

Does lmdeploy support GPU acceleration?

What are the best alternatives to lmdeploy?

How much does lmdeploy cost?

Pairs well with

Tools

Models

Hardware