是什么驱使代表指导? 关于指导拒绝问题的机械案例研究
What Drives Representation Steering? A Mechanistic Case Study on Steering Refusal

作者
Authors Stephen Cheng | Sarah Wiegreffe | Dinesh Manocha

期刊
Journal arXiv

年份
Year 2026

分类
Category 数据分析
Data Analysis

国家
Country -

🔗 访问原文
🔗 Access Paper

📝 摘要
Abstract

Applying steering vectors to large language models (LLMs) is an efficient and effective model alignment technique, but we lack an interpretable explanation for how it works-- specifically, what internal mechanisms steering vectors affect and how this results in different model outputs. To investigate the causal mechanisms underlying the effectiveness of steering vectors, we conduct a comprehensive case study on refusal. We propose a multi-token activation patching framework and discover that different steering methodologies leverage functionally interchangeable circuits when applied at the same layer. These circuits reveal that steering vectors primarily interact with the attention mechanism through the OV circuit while largely ignoring the QK circuit-- freezing all attention scores during steering drops performance by only 8.75% across two model families. A mathematical decomposition of the steered OV circuit further reveals semantically interpretable concepts, even in cases where the steering vector itself does not. Leveraging the activation patching results, we show that steering vectors can be sparsified by up to 90-99% while retaining most performance, and that different steering methodologies agree on a subset of important dimensions.

📊 文章统计
Article Statistics

基础数据
Basic Stats

89 浏览
Views

0 下载
Downloads

2 引用
Citations

引用趋势
Citation Trend

阅读国家分布
Country Distribution

阅读机构分布
Institution Distribution

月度浏览趋势
Monthly Views

影响因子分析
Impact Analysis

6.30 综合评分
Overall Score

引用影响力
Citation Impact

浏览热度
View Popularity

下载频次
Download Frequency

是什么驱使代表指导? 关于指导拒绝问题的机械案例研究
What Drives Representation Steering? A Mechanistic Case Study on Steering Refusal

📝 摘要
Abstract

📊 文章统计
Article Statistics

基础数据
Basic Stats

引用趋势
Citation Trend

阅读国家分布
Country Distribution

阅读机构分布
Institution Distribution

月度浏览趋势
Monthly Views

相关关键词
Related Keywords

影响因子分析
Impact Analysis

📄 相关文章
Related Articles

是什么驱使代表指导? 关于指导拒绝问题的机械案例研究What Drives Representation Steering? A Mechanistic Case Study on Steering Refusal

📝 摘要Abstract

📊 文章统计Article Statistics

基础数据Basic Stats

引用趋势Citation Trend

阅读国家分布Country Distribution

阅读机构分布Institution Distribution

月度浏览趋势Monthly Views

相关关键词Related Keywords

影响因子分析Impact Analysis

📄 相关文章Related Articles

海洋智能分析Ocean AI Analysis

是什么驱使代表指导? 关于指导拒绝问题的机械案例研究
What Drives Representation Steering? A Mechanistic Case Study on Steering Refusal

📝 摘要
Abstract

📊 文章统计
Article Statistics

基础数据
Basic Stats

引用趋势
Citation Trend

阅读国家分布
Country Distribution

阅读机构分布
Institution Distribution

月度浏览趋势
Monthly Views

相关关键词
Related Keywords

影响因子分析
Impact Analysis

📄 相关文章
Related Articles