相通、正统或冲突:何时可以安全优化思维链?
Aligned, Orthogonal or In-conflict: When can we safely optimize Chain-of-Thought?

作者
Authors 暂无作者信息

期刊
Journal 暂无期刊信息

年份
Year 2026

分类
Category

国家
Country -

DOI http://arxiv.org/abs/2603.30036v1

🔗 访问原文
🔗 Access Paper

📝 摘要
Abstract

Chain-of-Thought (CoT) monitoring, in which automated systems monitor the CoT of an LLM, is a promising approach for effectively overseeing AI systems. However, the extent to which a model's CoT helps us oversee the model - the monitorability of the CoT - can be affected by training, for instance by the model learning to hide important features of its reasoning. We propose and empirically validate a conceptual framework for predicting when and why this occurs. We model LLM post-training as an RL environment where the reward decomposes into two terms: one term depending on final outputs and another term depending on the CoT. Our framework allows us to classify these two terms as "aligned", "orthogonal", or "in-conflict" before training. We predict that training with in-conflict terms will reduce monitorability, orthogonal terms will not affect it, and aligned terms will improve it. To validate our framework, we use it to classify a set of RL environments, train LLMs within those environments, and evaluate how training affects CoT monitorability. We find that (1) training with "in-conflict" reward terms reduces CoT monitorability and (2) optimizing in-conflict reward terms is difficult.

📊 文章统计
Article Statistics

基础数据
Basic Stats

9 浏览
Views

0 下载
Downloads

0 引用
Citations

引用趋势
Citation Trend

阅读国家分布
Country Distribution

阅读机构分布
Institution Distribution

月度浏览趋势
Monthly Views

影响因子分析
Impact Analysis

6.00 综合评分
Overall Score

引用影响力
Citation Impact

浏览热度
View Popularity

下载频次
Download Frequency

相通、正统或冲突:何时可以安全优化思维链?Aligned, Orthogonal or In-conflict: When can we safely optimize Chain-of-Thought?

📝 摘要Abstract

📊 文章统计Article Statistics

基础数据Basic Stats

引用趋势Citation Trend

阅读国家分布Country Distribution

阅读机构分布Institution Distribution

月度浏览趋势Monthly Views

相关关键词Related Keywords

影响因子分析Impact Analysis