FIPO:用未来-KL影响的政策优化来启发深刻理性
FIPO: Eliciting Deep Reasoning with Future-KL Influenced Policy Optimization

期刊
Journal arXiv

年份
Year 2026

分类
Category 机器学习
Machine Learning

国家
Country 德国Germany

🔗 访问原文
🔗 Access Paper

📝 摘要
Abstract

We present Future-KL Influenced Policy Optimization (FIPO), a reinforcement learning algorithm designed to overcome reasoning bottlenecks in large language models. While GRPO style training scales effectively, it typically relies on outcome-based rewards (ORM) that distribute a global advantage uniformly across every token in a trajectory. We argue that this coarse-grained credit assignment imposes a performance ceiling by failing to distinguish critical logical pivots from trivial tokens. FIPO addresses this by incorporating discounted future-KL divergence into the policy update, creating a dense advantage formulation that re-weights tokens based on their influence on subsequent trajectory behavior. Empirically, FIPO enables models to break through the length stagnation seen in standard baselines. Evaluated on Qwen2.5-32B, FIPO extends the average chain-of-thought length from roughly 4,000 to over 10,000 tokens and increases AIME 2024 Pass@1 accuracy from 50.0% to a peak of 58.0% (converging at approximately 56.0\%). This outperforms both DeepSeek-R1-Zero-Math-32B (around 47.0%) and o1-mini (approximately 56.0%). Our results suggest that establishing dense advantage formulations is a vital path for evolving ORM-based algorithms to unlock the full reasoning potential of base models. We open-source our training system, built on the verl framework.

📊 文章统计
Article Statistics

基础数据
Basic Stats

285 浏览
Views

0 下载
Downloads

9 引用
Citations

引用趋势
Citation Trend

阅读国家分布
Country Distribution

阅读机构分布
Institution Distribution

月度浏览趋势
Monthly Views

影响因子分析
Impact Analysis

6.40 综合评分
Overall Score

引用影响力
Citation Impact

浏览热度
View Popularity

下载频次
Download Frequency

FIPO:用未来-KL影响的政策优化来启发深刻理性
FIPO: Eliciting Deep Reasoning with Future-KL Influenced Policy Optimization

📝 摘要
Abstract

📊 文章统计
Article Statistics

基础数据
Basic Stats

引用趋势
Citation Trend

阅读国家分布
Country Distribution

阅读机构分布
Institution Distribution

月度浏览趋势
Monthly Views

相关关键词
Related Keywords

影响因子分析
Impact Analysis

📄 相关文章
Related Articles

FIPO:用未来-KL影响的政策优化来启发深刻理性FIPO: Eliciting Deep Reasoning with Future-KL Influenced Policy Optimization

📝 摘要Abstract

📊 文章统计Article Statistics

基础数据Basic Stats

引用趋势Citation Trend

阅读国家分布Country Distribution

阅读机构分布Institution Distribution

月度浏览趋势Monthly Views

相关关键词Related Keywords

影响因子分析Impact Analysis

📄 相关文章Related Articles

海洋智能分析Ocean AI Analysis

FIPO:用未来-KL影响的政策优化来启发深刻理性
FIPO: Eliciting Deep Reasoning with Future-KL Influenced Policy Optimization

📝 摘要
Abstract

📊 文章统计
Article Statistics

基础数据
Basic Stats

引用趋势
Citation Trend

阅读国家分布
Country Distribution

阅读机构分布
Institution Distribution

月度浏览趋势
Monthly Views

相关关键词
Related Keywords

影响因子分析
Impact Analysis

📄 相关文章
Related Articles