统计 (Statistical) Embeddings for Similarity, Retrieval, and Interpretable Alignment of Numeric Tabular 数据 (Data)sets
Statistical Embeddings for Similarity, Retrieval, and Interpretable Alignment of Numeric Tabular Datasets

作者
Authors M. Ross Kunz | John Merickel | Keith Wilson

期刊
Journal 暂无期刊信息

年份
Year 2026

分类
Category 生物统计
Biostatistics

国家
Country -

🔗 访问原文
🔗 Access Paper

📝 摘要
Abstract

Numeric tabular datasets are the dominant data format in scientific practice, yet large language models lack native mechanisms for representing numeric datasets in a meaningful way across heterogeneous feature spaces. Existing approaches either target predictive modeling over individual datasets, which requires a shared set of variable definitions, or lack mechanisms for interpretable cross-dataset alignment. The proposed methodology characterizes numeric tabular datasets through structured exploratory data analysis descriptors, embeds those descriptors into a shared vector space using a pretrained sentence transformer, and quantifies cross-dataset similarity via Canonical Correlation Analysis (CCA). Furthermore, a penalized formulation of CCA is applied to recover sparse, interpretable variable-level correspondences between datasets, identifying which statistical descriptors or variable-level quantities drive cross-dataset alignment without requiring shared variable names or feature conventions. Differential privacy is optionally applied to the descriptor set prior to embedding, supporting deployment in sensitive data contexts without requiring access to raw observations at time of comparison. The methodology is evaluated across 15 datasets spanning general-purpose benchmarks, materials informatics, and nuclear-grade graphite characterization. Results demonstrate a total P@1 score of 0.9, with known nearest-neighbor retrieval and cluster structure remaining robust across embedding ablations and differential privacy budgets. The proposed framework provides a principled pathway for integrating heterogeneous numeric data into retrieval-augmented generation pipelines while preserving statistical context, with direct applications to data-driven algorithm selection and simulation model initialization for unknown datasets.

📊 文章统计
Article Statistics

基础数据
Basic Stats

124 浏览
Views

0 下载
Downloads

5 引用
Citations

引用趋势
Citation Trend

阅读国家分布
Country Distribution

阅读机构分布
Institution Distribution

月度浏览趋势
Monthly Views

影响因子分析
Impact Analysis

6.30 综合评分
Overall Score

引用影响力
Citation Impact

浏览热度
View Popularity

下载频次
Download Frequency

统计 (Statistical) Embeddings for Similarity, Retrieval, and Interpretable Alignment of Numeric Tabular 数据 (Data)sets
Statistical Embeddings for Similarity, Retrieval, and Interpretable Alignment of Numeric Tabular Datasets

📝 摘要
Abstract

📊 文章统计
Article Statistics

基础数据
Basic Stats

引用趋势
Citation Trend

阅读国家分布
Country Distribution

阅读机构分布
Institution Distribution

月度浏览趋势
Monthly Views

相关关键词
Related Keywords

影响因子分析
Impact Analysis

📄 相关文章
Related Articles

统计 (Statistical) Embeddings for Similarity, Retrieval, and Interpretable Alignment of Numeric Tabular 数据 (Data)setsStatistical Embeddings for Similarity, Retrieval, and Interpretable Alignment of Numeric Tabular Datasets

📝 摘要Abstract

📊 文章统计Article Statistics

基础数据Basic Stats

引用趋势Citation Trend

阅读国家分布Country Distribution

阅读机构分布Institution Distribution

月度浏览趋势Monthly Views

相关关键词Related Keywords

影响因子分析Impact Analysis

📄 相关文章Related Articles

海洋智能分析Ocean AI Analysis

统计 (Statistical) Embeddings for Similarity, Retrieval, and Interpretable Alignment of Numeric Tabular 数据 (Data)sets
Statistical Embeddings for Similarity, Retrieval, and Interpretable Alignment of Numeric Tabular Datasets

📝 摘要
Abstract

📊 文章统计
Article Statistics

基础数据
Basic Stats

引用趋势
Citation Trend

阅读国家分布
Country Distribution

阅读机构分布
Institution Distribution

月度浏览趋势
Monthly Views

相关关键词
Related Keywords

影响因子分析
Impact Analysis

📄 相关文章
Related Articles