Liger+ dynamically balances latency and throughput in large model inference

Research area:Distributed computingParallel Computing and Optimization TechniquesInference

What the study found

Liger+ is a distributed large model inference system that can dynamically balance latency and throughput on multi-GPU systems. The study says it does this through a new interleaved parallelism approach that mixes computation and communication across requests.

Why the authors say this matters

The authors say distributed large model inference is difficult to optimize because tensor parallelism, a method that splits work across GPUs, lowers latency but increases communication cost, while pipeline parallelism raises throughput but does not improve each request’s effectiveness. The study suggests Liger+ addresses this fixed tradeoff by allowing dynamic balance between the two goals.

What the researchers tested

The researchers developed Liger+, including task-aware batch management and distributed runtime modules. They evaluated it on multiple models and devices, using both discriminative and generative tasks on a 4-device setup, and compared it with fixed parallelism strategies.

What worked and what didn't

The evaluations show that Liger+ can, in most cases, better fit higher throughput demand while also achieving better latency than fixed parallelism strategies. On a 4-device discriminative task, it reduced P90 latency by 43.8% while keeping throughput the same as pipeline parallelism, and it increased throughput by 1.53× with improved P90 latency compared with tensor parallelism. On a 4-device generative task, it improved average throughput by 1.15× and reduced P90 latency by 26.2% compared with tensor parallelism.

What to keep in mind

The abstract does not describe detailed limitations beyond noting that the system was evaluated across models and devices and that the reported gains held in most cases. The summary available here does not include information about failures, edge cases, or conditions where the method did not perform as well.

Key points

Liger+ is described as a distributed large model inference system for multi-GPU systems.
The paper says it uses interleaved parallelism to balance latency and throughput dynamically.
The system includes task-aware batch management and distributed runtime modules.
In a 4-device discriminative task, Liger+ reduced P90 latency by 43.8% versus pipeline parallelism while keeping throughput the same.
In a 4-device generative task, Liger+ improved average throughput by 1.15× and reduced P90 latency by 26.2% versus tensor parallelism.

Disclosure

Research title:: Liger+ dynamically balances latency and throughput in large model inference
Authors:: Jinhui Wei, Shenggan Cheng, Wei Zhu, Jiazhi Jiang, Dan Huang, Zhiguang Chen, Jiangsu Du, Yutong Lu
Institutions:: Sun Yat-sen University, National University of Singapore, China Mobile (China)
Publication date:: 2026-02-23
DOI:: 10.1145/3797040
OpenAlex record:: View

AI provenance: This post was generated by OpenAI. The original authors did not write or review this post.

Liger+ dynamically balances latency and throughput in large model inference

What the study found

Why the authors say this matters

What the researchers tested

What worked and what didn't

What to keep in mind

Disclosure

More posts

Mineral fillers reduced tensile strength and raised surface resistivity in silicone rubber

SME contractors face three main barriers to IBS adoption

Liger+ dynamically balances latency and throughput in large model inference

What the study found

Why the authors say this matters

What the researchers tested

What worked and what didn't

What to keep in mind

Disclosure

More posts

Political organizing is argued to support egalitarian citizenship goals

Mineral fillers reduced tensile strength and raised surface resistivity in silicone rubber

SME contractors face three main barriers to IBS adoption

Local wisdom values support trust between school and multicultural community