Comparing the Top 6 Inference Runtimes for LLM Serving in 2025

Spread the love

Large language models are now limited less by training and more by how fast and cheaply we can serve tokens under real traffic. That comes down to three implementation details: how the runtime batches requests, how it overlaps prefill and decode, and how it stores and reuses the KV cache. Different engines make different tradeoffs on these axes, which show up directly as differences in tokens per second, P50/P99 latency, and GPU memory usage.
This article compares six runtimes that show up repeatedly in production stacks:
vLLM is built around PagedAttention. Instead of storing each sequence’s KV cache in a large contiguous buffer, it partitions KV into fixed size blocks and uses an indirection layer so each sequence points to a list of blocks.
This gives:
Recent versions add KV quantization (FP8) and integrate FlashAttention style kernels.
vLLM evaluation:
Where it fits
TensorRT LLM is a compilation based engine on top of NVIDIA TensorRT. It generates fused kernels per model and shape, and exposes an executor API used by frameworks such as Triton.
Its KV subsystem is explicit and feature rich:
NVIDIA reports that CPU based KV reuse can reduce time to first token by up to 14× on H100 and even more on GH200 in specific scenarios.
TensorRT LLM is highly tunable, so results vary. Common patterns from public comparisons and vendor benchmarks:
Where it fits
Text Generation Inference (TGI) is a server focused stack with:
TGI v3 adds a new long context pipeline:
For conventional prompts, recent third party work shows:
Latency profile:
Where it fits
LMDeploy is a toolkit for compression and deployment from the InternLM ecosystem. It exposes two engines:
Key runtime features:
LMDeploy delivers up to 1.8× higher request throughput than vLLM, attributing this to persistent batching, blocked KV and optimized kernels.
Evaluations show:
Latency:
Where it fits
SGLang is both:
RadixAttention:
Key Insights:
Reported KV cache hit rates range from roughly 50% to 99%, and cache aware schedulers get close to the optimal hit rate on the measured benchmarks.
Where it fits
DeepSpeed provides two pieces relevant for inference:
ZeRO Inference focuses on:
In the ZeRO Inference OPT 30B example on a single V100 32GB:
These numbers are small compared to GPU resident LLM runtimes on A100 or H100, but they apply to a model that does not fit natively in 32GB.
A recent I/O characterization of DeepSpeed and FlexGen confirms that offload based systems are dominated by small 128 KiB reads and that I/O behavior becomes the main bottleneck.
Where it fits
This table summarizes the main tradeoffs qualitatively:
For a production system, the choice tends to collapse to a few simple patterns:
Overall, all these engines are converging on the same idea: KV cache is the real bottleneck resource. The winners are the runtimes that treat KV as a first class data structure to be paged, quantized, reused and offloaded, not just a big tensor slapped into GPU memory.
Michal Sutter is a data science professional with a Master of Science in Data Science from the University of Padova. With a solid foundation in statistical analysis, machine learning, and data engineering, Michal excels at transforming complex datasets into actionable insights.

source

Save This Post

Comparing the Top 6 Inference Runtimes for LLM Serving in 2025 – MarkTechPost

Leave a Comment Cancel Reply