One-Step Gradient Delay is Not a Barrier for Large-Scale Asynchronous Pipeline Parallel LLM Pretraining

作者
Authors Philip Zmushko | Egor Petrov | Nursultan Abdullaev | Mikhail Khrushchev | Samuel Horváth

期刊
Journal arXiv

年份
Year 2026

分类
Category 数据分析
Data Analysis

国家
Country -

🔗 访问原文
🔗 Access Paper

📝 摘要
Abstract

Modern large-scale LLM pretraining benefits from utilizing Pipeline Parallelism; however, synchronous implementations leave GPUs idle during pipeline bubbles, wasting computational resources. Asynchronous Pipeline Parallelism eliminates these bubbles, maximizing throughput at the cost of gradient staleness. Among asynchronous schedules, PipeDream-2BW is particularly appealing: unlike the original PipeDream schedule, it ensures a constant one-step gradient delay regardless of pipeline depth. However, its adoption remains limited due to the common belief that optimizing under staleness is fundamentally unstable. In this work, we challenge this assumption, demonstrating that degradation under one-step delay depends strongly on optimizer choice rather than being an intrinsic limitation. We provide the first comprehensive empirical analysis showing that while AdamW, the predominant optimizer at the time when PipeDream-2BW was introduced, indeed suffers from severe degradation, recent methods like Muon exhibit strong robustness under a one-step delay. We introduce an optimizer-agnostic Error Feedback-inspired correction to further mitigate delay effects. We provide supporting theoretical analysis demonstrating convergence for Muon with and without this correction. Extensive evaluation on models up to 10B parameters confirms that our strategies bridge the performance gap with synchronous training, highlighting the practical potential of asynchronous pipeline parallelism at scale.

📊 文章统计
Article Statistics

基础数据
Basic Stats

74 浏览
Views

0 下载
Downloads

10 引用
Citations

引用趋势
Citation Trend

阅读国家分布
Country Distribution

阅读机构分布
Institution Distribution

月度浏览趋势
Monthly Views

影响因子分析
Impact Analysis

7.50 综合评分
Overall Score

引用影响力
Citation Impact

浏览热度
View Popularity

下载频次
Download Frequency

One-Step Gradient Delay is Not a Barrier for Large-Scale Asynchronous Pipeline Parallel LLM Pretraining

📝 摘要
Abstract

📊 文章统计
Article Statistics

基础数据
Basic Stats

引用趋势
Citation Trend

阅读国家分布
Country Distribution

阅读机构分布
Institution Distribution

月度浏览趋势
Monthly Views

相关关键词
Related Keywords

影响因子分析
Impact Analysis

📄 相关文章
Related Articles

One-Step Gradient Delay is Not a Barrier for Large-Scale Asynchronous Pipeline Parallel LLM Pretraining

📝 摘要Abstract

📊 文章统计Article Statistics

基础数据Basic Stats

引用趋势Citation Trend

阅读国家分布Country Distribution

阅读机构分布Institution Distribution

月度浏览趋势Monthly Views

相关关键词Related Keywords

影响因子分析Impact Analysis

📄 相关文章Related Articles

海洋智能分析Ocean AI Analysis

📝 摘要
Abstract

📊 文章统计
Article Statistics

基础数据
Basic Stats

引用趋势
Citation Trend

阅读国家分布
Country Distribution

阅读机构分布
Institution Distribution

月度浏览趋势
Monthly Views

相关关键词
Related Keywords

影响因子分析
Impact Analysis

📄 相关文章
Related Articles