ChatTTS: Talk with LLM Using This Text-to-Speech Tool!

ChatTTS is an advanced text-to-speech (TTS) model specifically designed for dialogue scenarios. Developed by the team at 2Noise, this model aims to deliver natural and expressive speech synthesis, making it ideal for applications such as virtual assistants, interactive voice response systems, and more. This article delves into what ChatTTS is, how it works, and provides a comprehensive guide on how to install and use it.

💡

Interested in the latest trend in AI?

Then, You cannot miss out Anakin AI!

Anakin AI is an all-in-one platform for all your workflow automation, create powerful AI App with an easy-to-use No Code App Builder, with Llama 3, Claude, GPT-4, Uncensored LLMs, Stable Diffusion...

Build Your Dream AI App within minutes, not weeks with Anakin AI!

Anakin AI: Automate Your AI Tasks with Ease — Anakin AI: Automate Your AI Workflow with No Code

Start for free

What is ChatTTS?

ChatTTS is a generative speech model optimized for dialogue-based tasks. Unlike traditional TTS systems that often sound robotic and lack the ability to convey subtle elements of human speech, ChatTTS excels in producing lifelike conversational experiences. It supports both English and Chinese languages and is trained on over 100,000 hours of data. The open-source version available on HuggingFace is trained on 40,000 hours of data.

Key Features of ChatTTS

Conversational TTS: Optimized for dialogue-based tasks, enabling natural and expressive speech synthesis with support for multiple speakers.
Fine-grained Control: The ability to predict and control fine-grained prosodic features like laughter, pauses, and interjections.
Improved Prosody: Surpassing most open-source TTS models in terms of prosody, delivering a truly lifelike experience.

How ChatTTS Works

ChatTTS leverages advanced machine learning techniques to generate speech that mimics human conversation. The model is designed to handle the nuances of dialogue, including intonation, pauses, and emotional expressions. Here’s a breakdown of its core components:

Model Architecture of ChatTTS

ChatTTS uses a combination of autoregressive and non-autoregressive models to generate speech. The autoregressive component helps in maintaining the flow of conversation, while the non-autoregressive part ensures that the speech is generated quickly and efficiently.

Training Data of ChatTTS

The model is trained on a massive dataset comprising over 100,000 hours of English and Chinese speech. This extensive training allows ChatTTS to understand and replicate the subtleties of human dialogue.

Fine-grained Control

One of the standout features of ChatTTS is its ability to control fine-grained prosodic features. This means it can insert laughter, pauses, and other interjections at appropriate moments, making the generated speech sound more natural and engaging.

How to Use ChatTTS: a Step-by-Step Guide

ChatTTS is a powerful text-to-speech library that allows you to generate high-quality audio from text input. It provides a simple and intuitive API for integrating speech synthesis into your Python projects. In this section, we will explore how to use ChatTTS with code examples and step-by-step instructions.

Install ChatTTS

To get started with ChatTTS, you need to install the required dependencies. Run the following command to install the necessary packages:

pip install omegaconf torch tqdm einops vector_quantize_pytorch transformers vocos IPython

Begin by importing the required modules in your Python script:

import torch
import ChatTTS
from IPython.display import Audio

Set the following configuration options for PyTorch:

torch._dynamo.config.cache_size_limit = 64
torch._dynamo.config.suppress_errors = True
torch.set_float32_matmul_precision('high')

Loading Models with ChatTTS

Create an instance of the ChatTTS.Chat class and load the pre-trained models:

chat = ChatTTS.Chat()
chat.load_models()

If the model weights have been updated, use the force_redownload=True parameter:

chat.load_models(force_redownload=True)

If you have downloaded the weights manually, specify the local path using the source and local_path parameters:

chat.load_models(source='local', local_path='YOUR LOCAL PATH')

How to Use ChatTTS' Inference

Batch Inference with ChatTTS

You can perform batch inference by providing a list of texts to the infer method:

texts = ["So we found being competitive and collaborative was a huge way of staying motivated towards our goals, so one person to call when you fall off, one person who gets you back on then one person to actually do the activity with.",]*3 \
        + ["You know, as coders, I think we've all got a bit of a soft spot for open source, right? I mean, it's just such a cool concept. But here's the thing - all the really cutting-edge, mind-blowing tech? It's being hoarded by these big shot companies, and they sure as hell aren't gonna be sharing it with the rest of us anytime soon."]*3     
        
wavs = chat.infer(texts)

You can then play the generated audio using the Audio function from IPython:

Audio(wavs[0], rate=24_000, autoplay=True)
Audio(wavs[3], rate=24_000, autoplay=True)

Using Custom Parameters with ChatTTS

You can customize the inference parameters by specifying params_infer_code and params_refine_text:

params_infer_code = {'prompt':'[speed_5]', 'temperature':.3}
params_refine_text = {'prompt':'[oral_2][laugh_0][break_6]'}

wav = chat.infer('Some of the best restaurants in Singapore include the three-Michelin-starred Odette known for its exquisite French cuisine, Restaurant Labyrinth which serves innovative modern Singaporean dishes, and Cloudstreet, a two-Michelin-starred contemporary restaurant by Chef Rishi Naleendra.', \
    params_refine_text=params_refine_text, params_infer_code=params_infer_code)

Using Random Speaker with ChatTTS

You can generate audio with a random speaker by sampling a random speaker embedding:

rand_spk = chat.sample_random_speaker()
params_infer_code = {'spk_emb' : rand_spk, }

wav = chat.infer('Some of the most iconic and beloved dishes in Singapore include chicken rice, with Tian Tian being one of the most famous, chili crab, with restaurants like Jumbo Seafood being very popular, and Peranakan cuisine, with the one-Michelin-starred Candlenut restaurant being a top spot to try authentic Peranakan flavors.', \
    params_refine_text=params_refine_text, params_infer_code=params_infer_code)

Implement Two-Stage Control with ChatTTS

ChatTTS allows you to control the text refinement and audio generation separately using the refine_text_only and skip_refine_text parameters:

text = "So we found being competitive and collaborative was a huge way of staying motivated towards our goals, so one person to call when you fall off, one person who gets you back on then one person to actually do the activity with."
chat.infer(text, refine_text_only=True)

text = 'so we found being competitive and collaborative [uv_break] was a huge way of staying [uv_break] motivated towards our goals, [uv_break] so [uv_break] one person to call [uv_break] when you fall off, [uv_break] one person who [uv_break] gets you back [uv_break] on then [uv_break] one person [uv_break] to actually do the activity with.'
wav = chat.infer(text, skip_refine_text=True)

LLM Integration with ChatTTS

ChatTTS can be integrated with language models (LLMs) to generate text based on user questions. Here's an example using the DeepSeek API:

from ChatTTS.experimental.llm import llm_api

API_KEY = ''
client = llm_api(api_key=API_KEY,
        base_url="https://api.deepseek.com",
        model="deepseek-chat")
user_question = 'What are the best restaurants in Singapore?'
text = client.call(user_question, prompt_version = 'deepseek')
print(text)
text = client.call(text, prompt_version = 'deepseek_TN')
print(text)

You can then generate audio using the generated text:

params_infer_code = {'spk_emb' : rand_spk, 'temperature':.3}

wav = chat.infer(text, params_infer_code=params_infer_code)

Using the ChatTTS Web UI

ChatTTS also provides a web-based user interface for generating audio. You can launch the web UI using the webui.py script:

python webui.py --server_name 0.0.0.0 --server_port 8080 --local_path /path/to/local/models

The web UI allows you to input text, adjust parameters, and generate audio interactively.

That's it! You now have a comprehensive guide on how to use ChatTTS in Python. With these examples and steps, you can integrate ChatTTS into your projects and generate high-quality speech from text input. Feel free to explore more advanced features and experiment with different parameters to customize the generated audio to your needs.

Conclusion

ChatTTS is a groundbreaking text-to-speech model that brings a new level of realism to conversational AI. With its ability to handle multiple languages, fine-grained control over prosodic features, and support for multiple speakers, it stands out as a powerful tool for developers and researchers alike. By following the installation and usage guidelines provided, you can start leveraging ChatTTS for your own projects and contribute to the ongoing development of this exciting technology.