AI Summary of Peer-Reviewed Research

This page presents an AI-generated summary of a published research paper. The original authors did not write or review this article. [See full disclosure ↓]

Publishing process signals: MODERATE — reflects the venue and review process. — venue and review process.

Liger+ dynamically balances latency and throughput in large model inference

Close-up view of network server rack cabling with multiple colored Ethernet cables (blue, yellow, green) bundled together, showing organized cable management and LED status indicators on server equipment in a data center environment.
Research area:Distributed computingParallel Computing and Optimization TechniquesInference

What the study found

Liger+ is a distributed large model inference system that can dynamically balance latency and throughput on multi-GPU systems. The study says it does this through a new interleaved parallelism approach that mixes computation and communication across requests.

Why the authors say this matters

The authors say distributed large model inference is difficult to optimize because tensor parallelism, a method that splits work across GPUs, lowers latency but increases communication cost, while pipeline parallelism raises throughput but does not improve each request’s effectiveness. The study suggests Liger+ addresses this fixed tradeoff by allowing dynamic balance between the two goals.

What the researchers tested

The researchers developed Liger+, including task-aware batch management and distributed runtime modules. They evaluated it on multiple models and devices, using both discriminative and generative tasks on a 4-device setup, and compared it with fixed parallelism strategies.

What worked and what didn't

The evaluations show that Liger+ can, in most cases, better fit higher throughput demand while also achieving better latency than fixed parallelism strategies. On a 4-device discriminative task, it reduced P90 latency by 43.8% while keeping throughput the same as pipeline parallelism, and it increased throughput by 1.53× with improved P90 latency compared with tensor parallelism. On a 4-device generative task, it improved average throughput by 1.15× and reduced P90 latency by 26.2% compared with tensor parallelism.

What to keep in mind

The abstract does not describe detailed limitations beyond noting that the system was evaluated across models and devices and that the reported gains held in most cases. The summary available here does not include information about failures, edge cases, or conditions where the method did not perform as well.

Key points

  • Liger+ is described as a distributed large model inference system for multi-GPU systems.
  • The paper says it uses interleaved parallelism to balance latency and throughput dynamically.
  • The system includes task-aware batch management and distributed runtime modules.
  • In a 4-device discriminative task, Liger+ reduced P90 latency by 43.8% versus pipeline parallelism while keeping throughput the same.
  • In a 4-device generative task, Liger+ improved average throughput by 1.15× and reduced P90 latency by 26.2% versus tensor parallelism.

Disclosure

Research title:
Liger+ dynamically balances latency and throughput in large model inference
Authors:
Jinhui Wei, Shenggan Cheng, Wei Zhu, Jiazhi Jiang, Dan Huang, Zhiguang Chen, Jiangsu Du, Yutong Lu
Institutions:
Sun Yat-sen University, National University of Singapore, China Mobile (China)
Publication date:
2026-02-23
OpenAlex record:
View
AI provenance: This post was generated by OpenAI. The original authors did not write or review this post.