TensorRT-LLM
inference-server13,821Apache-2.0

TensorRT-LLM

NVIDIA's optimized LLM inference engine delivering maximum performance on NVIDIA GPUs.

Updated Jun 7, 2026
Platforms
linux
Pricing
free-open-source
Status
active
License
Apache-2.0

What it does

Core capabilities at a glance

  • FP8, INT4, INT8 quantization with minimal quality loss
  • In-flight batching for maximum throughput
  • Multi-GPU and multi-node tensor parallelism
  • PagedAttention and KV cache management
  • OpenAI-compatible API server
  • Optimized for Llama, Falcon, Mistral, Gemma, Qwen

Deep dive

The full breakdown - performance, comparisons, and setup

TensorRT-LLM

TensorRT-LLM is NVIDIA's official LLM inference optimization library. It delivers the highest possible throughput on NVIDIA hardware through aggressive kernel optimization and quantization.

What it is

TRT-LLM is an open-source library that optimizes LLM inference on NVIDIA GPUs. It compiles models into optimized TensorRT engines with FP8, INT4, and INT8 quantization, in-flight batching, and multi-GPU parallelism.

Get started

pip install tensorrt_llm
# Build engine
trtllm-build --model_dir Qwen/Qwen3-8B \
  --dtype float16 --use_gpt_attention_plugin float16
# Run server
python examples/launch_triton_server.py \
  --model_repo trtllm_qwen_engine

When to use something else

  • Non-NVIDIA hardware: use llama.cpp or vLLM
  • Easier setup: vLLM has simpler installation
  • Beginner: Ollama for zero-config

Frequently asked

Quick answers to common questions

What is TensorRT-LLM?

TensorRT-LLM is a inference-server tool for local AI workloads. NVIDIA's optimized LLM inference engine delivering maximum performance on NVIDIA GPUs.

Is TensorRT-LLM free and open source?

Yes, TensorRT-LLM has 13,821 GitHub stars and is licensed under Apache-2.0. You can self-host it for free on linux.

What platforms does TensorRT-LLM support?

TensorRT-LLM runs on linux.

What hardware do I need for TensorRT-LLM?

The hardware requirements depend on which models you run. Check our hardware directory for compatible GPUs and systems. TensorRT-LLM has 13,821 GitHub stars and an active community.

Does TensorRT-LLM support GPU acceleration?

TensorRT-LLM supports GPU acceleration via CUDA, Metal, or Vulkan depending on your platform. For the best performance, pair it with an NVIDIA RTX 4090 or 5090.

What are the best alternatives to TensorRT-LLM?

Popular alternatives include other inference-server tools in our directory. Browse our full collection at /tool for comparisons, community reviews, and benchmark data to find the right fit for your workflow.

How much does TensorRT-LLM cost?

TensorRT-LLM is free-open-source. It is completely free and open source to self-host.

Pairs well with

Complementary tools, models, and hardware

Comments coming soon

Configure NEXT_PUBLIC_GISCUS_REPO_ID and NEXT_PUBLIC_GISCUS_CATEGORY_ID at giscus.app to enable.