Dynamic Latency-Throughput Balancing in Distributed Large Model Inference with Interleaved Parallelism

AI Summary of Peer-Reviewed Research

This page presents an AI-generated summary of a published research paper. The original authors did not write or review this article. See full disclosure ↓

ACM Transactions on Architecture and Code Optimization·2026-02-23·Peer-reviewed·View original paper ↗·Follow this topic (RSS)

Computer Science & AI Networks & Cloud Computing

Publication Signals show what we were able to verify about where this research was published.Core publication signals for this source were verified.ⓘ Publication Signals reflect the source’s verifiable credentials, not the quality of the research.

✔ Peer-reviewed source
✔ Published in indexed journal
✔ No retraction or integrity flags

Overview

Distributed large model inference requires balancing latency and throughput objectives. Tensor parallelism reduces latency through intensive inter-GPU communication, while pipeline parallelism increases throughput at the cost of per-request latency. Liger+ proposes interleaved parallelism to dynamically balance these competing metrics by interleaving computation and communication across multiple requests on multi-GPU architectures.

Methods and approach

Liger+ comprises two primary components: a task-aware batch management module and a distributed runtime module. The batch management layer organizes requests according to task characteristics, distinguishing between discriminative and generative workloads. The runtime module executes strategic scheduling of computation and communication kernels across multiple GPU streams through three mechanisms: precise kernel execution control via CPU-GPU and inter-stream synchronization, fine-grained resource mapping with contention factor anticipation, and kernel decomposition into smaller units to maximize overlap between computation and communication operations across requests.

Key Findings

On a 4-device discriminative task, Liger+ achieves 43.8% reduction in P90 latency while maintaining throughput parity with pipeline parallelism, and 1.53× throughput improvement with improved P90 latency relative to tensor parallelism. For generative tasks on 4 devices, the system demonstrates 1.15× average throughput improvement and 26.2% P90 latency reduction compared to tensor parallelism. The dynamic scheduling strategy enables simultaneous optimization across latency and throughput metrics in most evaluated scenarios.

Implications

Interleaved parallelism addresses a fundamental constraint in distributed inference systems where static parallelism strategy selection forces fixed trade-offs between latency and throughput. By enabling dynamic adaptation across requests, Liger+ expands the operational feasible region for inference deployments that must satisfy heterogeneous performance constraints. The fine-grained kernel scheduling and resource contention modeling represent transferable techniques for optimizing multi-GPU execution.

Disclosure

Research title: Dynamic Latency-Throughput Balancing in Distributed Large Model Inference with Interleaved Parallelism
Authors: Jinhui Wei, Shenggan Cheng, Wei Zhu, Jiazhi Jiang, Dan Huang, Zhiguang Chen, Xu Chen, Yutong Lu
Publication date: 2026-02-23
DOI: https://doi.org/10.1145/3797040
OpenAlex record: View
Image credit: Photo by ElasticComputeFarm on Pixabay (Source • License)
Disclosure: This post was generated by Claude (Anthropic). The original authors did not write or review this post.

Dynamic Latency-Throughput Balancing in Distributed Large Model Inference with Interleaved Parallelism

Overview

Methods and approach

Key Findings

Implications

Disclosure

More posts

Digital defocus interference enables automated microscopy focusing

Sociotechnical barriers hinder digital engineering transformation

Derivative-free Bayesian design method for sequential settings

Encoding choices dominate performance in hybrid quantum neural networks

Dynamic Latency-Throughput Balancing in Distributed Large Model Inference with Interleaved Parallelism

Overview

Methods and approach

Key Findings

Implications

Disclosure

Get the weekly research newsletter

Related research in Computer Science & AI

More posts

Digital defocus interference enables automated microscopy focusing

Sociotechnical barriers hinder digital engineering transformation

Derivative-free Bayesian design method for sequential settings

Encoding choices dominate performance in hybrid quantum neural networks