When Are Two Networks the Same? Tensor Similarity for Mechanistic Interpretability
作者
Authors
ML Nissen Gonzalez | Melwina Albuquerque | Laurence Wroe | Jacob Meyer Cohen | Logan Riggs Smith | Thomas Dooms
期刊
Journal
暂无期刊信息
年份
Year
2026
分类
Category
国家
Country
-
📝 摘要
Abstract
Mechanistic interpretability aims to break models into meaningful parts; verifying that two such parts implement the same computation is a prerequisite. Existing similarity measures evaluate either empirical behaviour, leaving them blind to out-of-distribution mechanisms, or basis-dependent parameters, meaning they disregard weight-space symmetries. To address these issues for the class of tensor-based models, we introduce a weight-based metric, tensor similarity, that is invariant to such symmetries. This metric captures global functional equivalence and accounts for cross-layer mechanisms using an efficient recursive algorithm. Empirically, tensor similarity tracks functional training dynamics, such as grokking and backdoor insertion, with higher fidelity than existing metrics. This reduces measuring similarity and verifying faithfulness into a solved algebraic problem rather than one of empirical approximation.
📊 文章统计
Article Statistics
基础数据
Basic Stats
183
浏览
Views
0
下载
Downloads
0
引用
Citations
引用趋势
Citation Trend
阅读国家分布
Country Distribution
阅读机构分布
Institution Distribution
月度浏览趋势
Monthly Views