🎙 [**Towards Joint Modeling of Dialogue Response and Speech Synthesis based on Large Language Model**](https://huggingface.co/papers/2309.11000) [Xinyu Zhou (周欣宇)](https://www.linkedin.com/in/xinyu-zhou2000/), [Delong Chen (陈德龙)](https://chendelong.world/), [Yudong Chen (陈玉东)](https://rwxy.cuc.edu.cn/2019/0730/c5134a133504/pagem.htm) [ArXiv](https://arxiv.org/abs/2309.11000) | [Poster](doc/YFRSW_Poster.pdf) | [Notebook](prosody_prediction.ipynb) | [Github](https://github.com/XinyuZhou2000/Spoken-Dialogue)
This project explores the potential of constructing an AI spoken dialogue system that *"thinks how to respond"* and *"thinks how to speak"* simultaneously, which more closely aligns with the human speech production process compared to the current cascade pipeline of independent chatbot and Text-to-Speech (TTS) modules. We hypothesize that *Large Language Models (LLMs)* with billions of parameters possess significant speech understanding capabilities and can jointly model dialogue responses and linguistic features. We investigate the task of Prosodic structure prediction (PSP), a typical front-end task in TTS, demonstrating the speech understanding ability of LLMs.