• https://github.com/bytedance/MegaTTS3

这是字节开源的一个TTS模型,支持中文、英文、日文、韩文等多种语言。4亿5千万参数。

使用时,上传的音频样例文件小于24s,文件中不要有空格

1. 安装

从GitHub克隆仓库:

# Clone the repository
git clone https://github.com/bytedance/MegaTTS3
cd MegaTTS3

安装在类Unix系统中:


# Create a python 3.10 conda env (you could also use virtualenv)
conda create -n megatts3-env python=3.10
conda activate megatts3-env
pip install -r requirements.txt

# Set the root directory
export PYTHONPATH="/path/to/MegaTTS3:$PYTHONPATH"

# [Optional] Set GPU
export CUDA_VISIBLE_DEVICES=0

# If you encounter bugs with pydantic in inference, you should check if the versions of pydantic and gradio are matched.
# 如果你在推理中遇到pydantic的错误,你应该检查pydantic和grado的版本是否匹配。
# [Note] if you encounter bugs related with httpx, please check that whether your environmental variable "no_proxy" has patterns like "::"
# 如果你在推理中遇到httpx相关的错误,请检查你的环境变量"no_proxy"中是否有类似"::"的模式

使用docker运行:

# [The Docker version is currently under testing]
# Docker版本目前是测试中的
# ! You should download the pretrained checkpoint before running the following command
# 执行下面的命令前,请先下载预训练模型
docker build . -t megatts3:latest

# For GPU inference
docker run -it -p 7929:7929 --gpus all -e CUDA_VISIBLE_DEVICES=0 megatts3:latest
# For CPU inference
docker run -it -p 7929:7929  megatts3:latest

# Visit http://0.0.0.0:7929/ for gradio.

2. 推理

标准版

# p_w (intelligibility weight), t_w (similarity weight). Typically, prompt with more noises requires higher p_w and t_w
python tts/infer_cli.py --input_wav 'assets/Chinese_prompt.wav'  --input_text "另一边的桌上,一位读书人嗤之以鼻道,'佛子三藏,神子燕小鱼是什么样的人物,李家的那个李子夜如何与他们相提并论?'" --output_dir ./gen

# As long as audio volume and pronunciation are appropriate, increasing --t_w within reasonable ranges (2.0~5.0)
# will increase the generated speech's expressiveness and similarity (especially for some emotional cases).
python tts/infer_cli.py --input_wav 'assets/English_prompt.wav' --input_text 'As his long promised tariff threat turned into reality this week, top human advisers began fielding a wave of calls from business leaders, particularly in the automotive sector, along with lawmakers who were sounding the alarm.' --output_dir ./gen --p_w 2.0 --t_w 3.0

口音版

  • https://drive.google.com/drive/folders/1gCWL1y_2xu9nIFhUX_OW5MbcFuB7J5Cl 可以在面的谷歌云硬盘中,下载一些已有口音案例 ```shell

    When p_w (intelligibility weight) ≈ 1.0, the generated audio closely retains the speaker’s original accent. As p_w increases, it shifts toward standard pronunciation.

    t_w (similarity weight) is typically set 0–3 points higher than p_w for optimal results.

    Useful for accented TTS or solving the accent problems in cross-lingual TTS.

    python tts/infer_cli.py –input_wav ‘assets/English_prompt.wav’ –input_text ‘这是一条有口音的音频。’ –output_dir ./gen –p_w 1.0 –t_w 3.0

python tts/infer_cli.py –input_wav ‘assets/English_prompt.wav’ –input_text ‘这条音频的发音标准一些了吗?’ –output_dir ./gen –p_w 2.5 –t_w 2.5


WebUI版

```shell
# We also support cpu inference, but it may take about 30 seconds (for 10 inference steps).
python tts/gradio_api.py

Gradio 是一个用于快速创建机器学习和数据科学模型交互界面的开源 Python 库。它允许开发者通过简单的 API 将模型包装成直观的 Web 界面,无需前端开发经验即可实现模型的可视化展示与交互测试。

import gradio as gr


def greet(name):
    return "Hello " + name + "!"


# 创建界面
demo = gr.Interface(
    fn=greet,  # 模型函数
    inputs="text",  # 输入类型
    outputs="text",  # 输出类型
    title="Hello World Demo"  # 界面标题
)
if __name__ == '__main__':
    api_interface = gr.Interface(fn=
                                 greet,
                                 inputs=[gr.Audio(type="filepath", label="Upload .wav"),
                                         gr.File(type="filepath", label="Upload .npy"), "text",
                                         gr.Number(label="infer timestep", value=32),
                                         gr.Number(label="Intelligibility Weight", value=1.4),
                                         gr.Number(label="Similarity Weight", value=3.0)],
                                 outputs=[gr.Audio(label="Synthesized Audio")],
                                 title="MegaTTS3",
                                 description="Upload a speech clip as a reference for timbre, " +
                                             "upload the pre-extracted latent file, " +
                                             "input the target text, and receive the cloned voice.",
                                 concurrency_limit=1)
    api_interface.launch(server_name='0.0.0.0', server_port=7929, debug=True)
    # 启动界面
    demo.launch()

上述代码会创建下述的界面:fn= greet就是回调函数。 gradio_api.png


Back to top

Press ctrl+k to search

Page last modified: Jun 17 2025 at 12:00 AM.