Latency–Throughput Tradeoffs of ONNX Runtime, TensorRT-LLM, vLLM, and Triton: An Empirical Comparison on 1B–3B Parameter LLM Inference

Chuankai Luo; Xu Wang

doi:10.66372/JGER.v4i1.12

Authors

Chuankai Luo Department of Electronic Engineering, Tsinghua University, Beijing, China Author
Xu Wang Computer Science, Beijing University of Posts and Telecommunications, Beijing, China Author

DOI:

https://doi.org/10.66372/JGER.v4i1.12

Keywords:

LLM inference, inference engines, latency–throughput tradeoff, GPU serving

Abstract

Small-scale large language models with one to three billion parameters have emerged as a practical choice for latency-sensitive deployment, offering a balance between linguistic capability and operational cost. Serving these models efficiently depends heavily on the choice of inference engine, yet published comparisons often target 7B-scale or larger workloads and report metrics in isolation. This paper presents an empirical comparison of four widely adopted inference engines—ONNX Runtime, TensorRT-LLM, vLLM, and NVIDIA Triton configured with the TensorRT-LLM backend—on three open-weight checkpoints (TinyLlama-1.1B, Llama-3.2-1B, and Llama-3.2-3B). Using prompt distributions drawn from ShareGPT, LMSYS-Chat-1M, and HumanEval, we measure time-to-first-token, inter-token latency, sustained throughput across batch sizes from 1 to 64, GPU utilization, and peak memory footprint on an NVIDIA A100 40GB and an RTX 4090 24GB. Results indicate that TensorRT-LLM attains the lowest single-request latency (22.6 ms time-to-first-token on the 1.1B checkpoint), while vLLM delivers comparable throughput at large batch sizes with a 13.8% smaller memory footprint at the 3B scale due to its paged KV-cache layout. ONNX Runtime trails on both dimensions yet remains competitive for cross-platform deployment. The findings yield quantitative deployment guidance for cost-aware serving of small-scale language models.

Author Biography

Xu Wang, Computer Science, Beijing University of Posts and Telecommunications, Beijing, China

Latency–Throughput Tradeoffs of ONNX Runtime, TensorRT-LLM, vLLM, and Triton: An Empirical Comparison on 1B–3B Parameter LLM Inference

Authors

DOI:

Keywords:

Abstract

Author Biography

Downloads

Published

Issue

Section

License

How to Cite

Manu

For Authors

About Journal

Editorial Team

Make a Submission

Ready to Publish