What the study found
Liger+ is a distributed large model inference system that can dynamically balance latency and throughput on multi-GPU systems. The study says it does this through a new interleaved parallelism approach that mixes computation and communication across requests.
Why the authors say this matters
The authors say distributed large model inference is difficult to optimize because tensor parallelism, a method that splits work across GPUs, lowers latency but increases communication cost, while pipeline parallelism raises throughput but does not improve each request’s effectiveness. The study suggests Liger+ addresses this fixed tradeoff by allowing dynamic balance between the two goals.
What the researchers tested
The researchers developed Liger+, including task-aware batch management and distributed runtime modules. They evaluated it on multiple models and devices, using both discriminative and generative tasks on a 4-device setup, and compared it with fixed parallelism strategies.
What worked and what didn't
The evaluations show that Liger+ can, in most cases, better fit higher throughput demand while also achieving better latency than fixed parallelism strategies. On a 4-device discriminative task, it reduced P90 latency by 43.8% while keeping throughput the same as pipeline parallelism, and it increased throughput by 1.53× with improved P90 latency compared with tensor parallelism. On a 4-device generative task, it improved average throughput by 1.15× and reduced P90 latency by 26.2% compared with tensor parallelism.
What to keep in mind
The abstract does not describe detailed limitations beyond noting that the system was evaluated across models and devices and that the reported gains held in most cases. The summary available here does not include information about failures, edge cases, or conditions where the method did not perform as well.
Key points
- Liger+ is described as a distributed large model inference system for multi-GPU systems.
- The paper says it uses interleaved parallelism to balance latency and throughput dynamically.
- The system includes task-aware batch management and distributed runtime modules.
- In a 4-device discriminative task, Liger+ reduced P90 latency by 43.8% versus pipeline parallelism while keeping throughput the same.
- In a 4-device generative task, Liger+ improved average throughput by 1.15× and reduced P90 latency by 26.2% versus tensor parallelism.
Disclosure
- Research title:
- Liger+ dynamically balances latency and throughput in large model inference
- Authors:
- Jinhui Wei, Shenggan Cheng, Wei Zhu, Jiazhi Jiang, Dan Huang, Zhiguang Chen, Jiangsu Du, Yutong Lu
- Institutions:
- Sun Yat-sen University, National University of Singapore, China Mobile (China)
- Publication date:
- 2026-02-23
- DOI:
- 10.1145/3797040
- OpenAlex record:
- View
Get the weekly research newsletter
Stay current with peer-reviewed research without reading academic papers — one filtered digest, every Friday.


