vLLM vs TensorRT-LLM vs HF TGI vs LMDeploy, A Deep Technical Comparison for Production LLM Inference – MarkTechPost

Production LLM serving is now a systems problem, not a generate() loop. For real workloads, the choice of inference stack drives your tokens per second, tail latency, and ultimately cost per million tokens on a given GPU fleet.
This comparison focuses on 4 widely used stacks:
Core idea
vLLM is built around PagedAttention, an attention implementation that treats the KV cache like paged virtual memory rather than a single contiguous buffer per sequence.
Instead of allocating one big KV region per request, vLLM:
This reduces external fragmentation and lets the scheduler pack many more concurrent sequences into the same VRAM.
Throughput and latency
vLLM improves throughput by 2–4× over systems like FasterTransformer and Orca at similar latency, with larger gains for longer sequences.
Key properties for operators:
vLLM exposes an OpenAI compatible HTTP API and integrates well with Ray Serve and other orchestrators, which is why it is widely used as an open baseline.
KV and multi tenant
Core idea
TensorRT-LLM is NVIDIA’s optimized inference library for their GPUs. The library provides custom attention kernels, inflight batching, paged KV caching, quantization down to FP4 and INT4, and speculative decoding.
It is tightly coupled to NVIDIA hardware, including FP8 tensor cores on Hopper and Blackwell.
Measured performance
NVIDIA’s H100 vs A100 evaluation is the most concrete public reference:
For latency sensitive modes:
These numbers are model and shape specific, but they give a realistic scale.
Prefill vs decode
TensorRT-LLM optimizes both phases:
The result is very high tokens/s across a wide range of input and output lengths, especially when the engine is tuned for that model and batch profile.
KV and multi tenant
TensorRT-LLM provides:
NVIDIA pairs this with Ray based or Triton based orchestration patterns for multi tenant clusters. Multi model support is done at the orchestrator level, not inside a single TensorRT-LLM engine instance.
Core idea
Text Generation Inference (TGI) is a Rust and Python based serving stack that adds:
Version 3 focuses on long prompt processing through chunking and prefix caching.
Long prompt benchmark vs vLLM
The TGI v3 docs give a clear benchmark:
The mechanism is:
This is a targeted optimization for workloads where prompts are extremely long and reused across turns, for example RAG pipelines and analytic summarization.
Architecture and latency behavior
Key components:
For short chat style workloads, throughput and latency are in the same ballpark as vLLM. For long, cacheable contexts, both P50 and P99 latency improve by an order of magnitude because the engine avoids repeated prefill.
Multi backend and multi model
TGI is designed as a router plus model server architecture. It can:
This makes it suitable as a central serving tier in multi tenant environments.
Core idea
LMDeploy from the InternLM ecosystem is a toolkit for compressing and serving LLMs, centered around the TurboMind engine. It focuses on:
Relative throughput vs vLLM
The project states:
KV, quantization and latency
LMDeploy includes:
This makes LMDeploy attractive when you want to run larger open models like InternLM or Qwen on mid range GPUs with aggressive compression while still maintaining good tokens/s.
Multi model deployments
LMDeploy provides a proxy server able to handle:
So architecturally it sits closer to TGI than to a single engine.
In practice, many dev teams mix these systems, for example using TensorRT-LLM for high volume proprietary chat, TGI v3 for long context analytics, vLLM or LMDeploy for experimental and open model workloads. The key is to align throughput, latency tails, and KV behavior with the actual token distributions in your traffic, then compute cost per million tokens from measured tokens/s on your own hardware.
Michal Sutter is a data science professional with a Master of Science in Data Science from the University of Padova. With a solid foundation in statistical analysis, machine learning, and data engineering, Michal excels at transforming complex datasets into actionable insights.

source

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top