MOSS-TTSD: Text to Spoken Dialogue Generation
作者
Authors
Yuqian Zhang|Donghua Yu|Zhengyuan Lin|Botian Jiang|Mingshu Chen|Yaozhou Jiang|Yiwei Zhao|Yiyang Zhang|Yucheng Yuan|Hanfu Chen|Kexin Huang|Jun Zhan|Cheng Chang|Zhaoye Fei|Shimin Li|Xiaogui Yang|Qinyuan Cheng|Xipeng Qiu
期刊
Journal
暂无期刊信息
年份
Year
2026
分类
Category
国家
Country
德国Germany
📝 摘要
Abstract
Spoken dialogue generation is crucial for applications like podcasts, dynamic commentary, and entertainment content, but poses significant challenges compared to single-utterance text-to-speech (TTS). Key requirements include accurate turn-taking, cross-turn acoustic consistency, and long-form stability, which current models often fail to address due to a lack of dialogue context modeling. To bridge this gap, we present MOSS-TTSD, a spoken dialogue synthesis model designed for expressive, multi-party conversational speech across multiple languages. With enhanced long-context modeling, MOSS-TTSD generates long-form spoken conversations from dialogue scripts with explicit speaker tags, supporting up to 60 minutes of single-pass synthesis, multi-party dialogue with up to 5 speakers, and zero-shot voice cloning from a short reference audio clip. The model supports various mainstream languages, including English and Chinese, and is adapted to several long-form scenarios. Additionally, to address limitations of existing evaluation methods, we propose TTSD-eval, an objective evaluation framework based on forced alignment that measures speaker attribution accuracy and speaker similarity without relying on speaker diarization tools. Both objective and subjective evaluation results show that MOSS-TTSD surpasses strong open-source and proprietary baselines in dialogue synthesis.
📊 文章统计
Article Statistics
基础数据
Basic Stats
65
浏览
Views
0
下载
Downloads
25
引用
Citations
引用趋势
Citation Trend
阅读国家分布
Country Distribution
阅读机构分布
Institution Distribution
月度浏览趋势
Monthly Views