AI Summary of Peer-Reviewed Research
This page presents an AI-generated summary of a published research paper. The original authors did not write or review this article. See full disclosure ↓
Publication Signals show what we were able to verify about where this research was published.MODERATECore publication signals for this source were verified. Publication Signals reflect the source’s verifiable credentials, not the quality of the research.
- ✔ Peer-reviewed source
- ✔ Published in indexed journal
- ✔ No retraction or integrity flags
Overview
Distributed large model inference requires balancing latency and throughput objectives. Tensor parallelism reduces latency through intensive inter-GPU communication, while pipeline parallelism increases throughput at the cost of per-request latency. Liger+ proposes interleaved parallelism to dynamically balance these competing metrics by interleaving computation and communication across multiple requests on multi-GPU architectures.
Methods and approach
Liger+ comprises two primary components: a task-aware batch management module and a distributed runtime module. The batch management layer organizes requests according to task characteristics, distinguishing between discriminative and generative workloads. The runtime module executes strategic scheduling of computation and communication kernels across multiple GPU streams through three mechanisms: precise kernel execution control via CPU-GPU and inter-stream synchronization, fine-grained resource mapping with contention factor anticipation, and kernel decomposition into smaller units to maximize overlap between computation and communication operations across requests.
Key Findings
On a 4-device discriminative task, Liger+ achieves 43.8% reduction in P90 latency while maintaining throughput parity with pipeline parallelism, and 1.53× throughput improvement with improved P90 latency relative to tensor parallelism. For generative tasks on 4 devices, the system demonstrates 1.15× average throughput improvement and 26.2% P90 latency reduction compared to tensor parallelism. The dynamic scheduling strategy enables simultaneous optimization across latency and throughput metrics in most evaluated scenarios.
Implications
Interleaved parallelism addresses a fundamental constraint in distributed inference systems where static parallelism strategy selection forces fixed trade-offs between latency and throughput. By enabling dynamic adaptation across requests, Liger+ expands the operational feasible region for inference deployments that must satisfy heterogeneous performance constraints. The fine-grained kernel scheduling and resource contention modeling represent transferable techniques for optimizing multi-GPU execution.
Disclosure
- Research title: Dynamic Latency-Throughput Balancing in Distributed Large Model Inference with Interleaved Parallelism
- Authors: Jinhui Wei, Shenggan Cheng, Wei Zhu, Jiazhi Jiang, Dan Huang, Zhiguang Chen, Xu Chen, Yutong Lu
- Publication date: 2026-02-23
- DOI: https://doi.org/10.1145/3797040
- OpenAlex record: View
- Image credit: Photo by ElasticComputeFarm on Pixabay (Source • License)
- Disclosure: This post was generated by Claude (Anthropic). The original authors did not write or review this post.
Get the weekly research newsletter
Stay current with peer-reviewed research without reading academic papers — one filtered digest, every Friday.


