ChatTTS is an advanced text-to-speech (TTS) model specifically designed for dialogue scenarios. Developed by the team at 2Noise, this model aims to deliver natural and expressive speech synthesis, making it ideal for applications such as virtual assistants, interactive voice response systems, and more. This article delves into what ChatTTS is, how it works, and provides a comprehensive guide on how to install and use it.
Then, You cannot miss out Anakin AI!
Anakin AI is an all-in-one platform for all your workflow automation, create powerful AI App with an easy-to-use No Code App Builder, with Llama 3, Claude, GPT-4, Uncensored LLMs, Stable Diffusion...
Build Your Dream AI App within minutes, not weeks with Anakin AI!
What is ChatTTS?
ChatTTS is a generative speech model optimized for dialogue-based tasks. Unlike traditional TTS systems that often sound robotic and lack the ability to convey subtle elements of human speech, ChatTTS excels in producing lifelike conversational experiences. It supports both English and Chinese languages and is trained on over 100,000 hours of data. The open-source version available on HuggingFace is trained on 40,000 hours of data.
Key Features of ChatTTS
- Conversational TTS: Optimized for dialogue-based tasks, enabling natural and expressive speech synthesis with support for multiple speakers.
- Fine-grained Control: The ability to predict and control fine-grained prosodic features like laughter, pauses, and interjections.
- Improved Prosody: Surpassing most open-source TTS models in terms of prosody, delivering a truly lifelike experience.
How ChatTTS Works
ChatTTS leverages advanced machine learning techniques to generate speech that mimics human conversation. The model is designed to handle the nuances of dialogue, including intonation, pauses, and emotional expressions. Here’s a breakdown of its core components:
Model Architecture of ChatTTS
ChatTTS uses a combination of autoregressive and non-autoregressive models to generate speech. The autoregressive component helps in maintaining the flow of conversation, while the non-autoregressive part ensures that the speech is generated quickly and efficiently.
Training Data of ChatTTS
The model is trained on a massive dataset comprising over 100,000 hours of English and Chinese speech. This extensive training allows ChatTTS to understand and replicate the subtleties of human dialogue.
Fine-grained Control
One of the standout features of ChatTTS is its ability to control fine-grained prosodic features. This means it can insert laughter, pauses, and other interjections at appropriate moments, making the generated speech sound more natural and engaging.
How to Use ChatTTS: a Step-by-Step Guide
ChatTTS is a powerful text-to-speech library that allows you to generate high-quality audio from text input. It provides a simple and intuitive API for integrating speech synthesis into your Python projects. In this section, we will explore how to use ChatTTS with code examples and step-by-step instructions.
Install ChatTTS
To get started with ChatTTS, you need to install the required dependencies. Run the following command to install the necessary packages:
pip install omegaconf torch tqdm einops vector_quantize_pytorch transformers vocos IPython
Begin by importing the required modules in your Python script:
import torch
import ChatTTS
from IPython.display import Audio
Set the following configuration options for PyTorch:
torch._dynamo.config.cache_size_limit = 64
torch._dynamo.config.suppress_errors = True
torch.set_float32_matmul_precision('high')
Loading Models with ChatTTS
Create an instance of the ChatTTS.Chat
class and load the pre-trained models:
chat = ChatTTS.Chat()
chat.load_models()
If the model weights have been updated, use the force_redownload=True
parameter:
chat.load_models(force_redownload=True)
If you have downloaded the weights manually, specify the local path using the source
and local_path
parameters:
chat.load_models(source='local', local_path='YOUR LOCAL PATH')
How to Use ChatTTS' Inference
Batch Inference with ChatTTS
You can perform batch inference by providing a list of texts to the infer
method:
texts = ["So we found being competitive and collaborative was a huge way of staying motivated towards our goals, so one person to call when you fall off, one person who gets you back on then one person to actually do the activity with.",]*3 \
+ ["You know, as coders, I think we've all got a bit of a soft spot for open source, right? I mean, it's just such a cool concept. But here's the thing - all the really cutting-edge, mind-blowing tech? It's being hoarded by these big shot companies, and they sure as hell aren't gonna be sharing it with the rest of us anytime soon."]*3
wavs = chat.infer(texts)
You can then play the generated audio using the Audio
function from IPython:
Audio(wavs[0], rate=24_000, autoplay=True)
Audio(wavs[3], rate=24_000, autoplay=True)
Using Custom Parameters with ChatTTS
You can customize the inference parameters by specifying params_infer_code
and params_refine_text
:
params_infer_code = {'prompt':'[speed_5]', 'temperature':.3}
params_refine_text = {'prompt':'[oral_2][laugh_0][break_6]'}
wav = chat.infer('Some of the best restaurants in Singapore include the three-Michelin-starred Odette known for its exquisite French cuisine, Restaurant Labyrinth which serves innovative modern Singaporean dishes, and Cloudstreet, a two-Michelin-starred contemporary restaurant by Chef Rishi Naleendra.', \
params_refine_text=params_refine_text, params_infer_code=params_infer_code)
Using Random Speaker with ChatTTS
You can generate audio with a random speaker by sampling a random speaker embedding:
rand_spk = chat.sample_random_speaker()
params_infer_code = {'spk_emb' : rand_spk, }
wav = chat.infer('Some of the most iconic and beloved dishes in Singapore include chicken rice, with Tian Tian being one of the most famous, chili crab, with restaurants like Jumbo Seafood being very popular, and Peranakan cuisine, with the one-Michelin-starred Candlenut restaurant being a top spot to try authentic Peranakan flavors.', \
params_refine_text=params_refine_text, params_infer_code=params_infer_code)
Implement Two-Stage Control with ChatTTS
ChatTTS allows you to control the text refinement and audio generation separately using the refine_text_only
and skip_refine_text
parameters:
text = "So we found being competitive and collaborative was a huge way of staying motivated towards our goals, so one person to call when you fall off, one person who gets you back on then one person to actually do the activity with."
chat.infer(text, refine_text_only=True)
text = 'so we found being competitive and collaborative [uv_break] was a huge way of staying [uv_break] motivated towards our goals, [uv_break] so [uv_break] one person to call [uv_break] when you fall off, [uv_break] one person who [uv_break] gets you back [uv_break] on then [uv_break] one person [uv_break] to actually do the activity with.'
wav = chat.infer(text, skip_refine_text=True)
LLM Integration with ChatTTS
ChatTTS can be integrated with language models (LLMs) to generate text based on user questions. Here's an example using the DeepSeek API:
from ChatTTS.experimental.llm import llm_api
API_KEY = ''
client = llm_api(api_key=API_KEY,
base_url="https://api.deepseek.com",
model="deepseek-chat")
user_question = 'What are the best restaurants in Singapore?'
text = client.call(user_question, prompt_version = 'deepseek')
print(text)
text = client.call(text, prompt_version = 'deepseek_TN')
print(text)
You can then generate audio using the generated text:
params_infer_code = {'spk_emb' : rand_spk, 'temperature':.3}
wav = chat.infer(text, params_infer_code=params_infer_code)
Using the ChatTTS Web UI
ChatTTS also provides a web-based user interface for generating audio. You can launch the web UI using the webui.py
script:
python webui.py --server_name 0.0.0.0 --server_port 8080 --local_path /path/to/local/models
The web UI allows you to input text, adjust parameters, and generate audio interactively.
That's it! You now have a comprehensive guide on how to use ChatTTS in Python. With these examples and steps, you can integrate ChatTTS into your projects and generate high-quality speech from text input. Feel free to explore more advanced features and experiment with different parameters to customize the generated audio to your needs.
Conclusion
ChatTTS is a groundbreaking text-to-speech model that brings a new level of realism to conversational AI. With its ability to handle multiple languages, fine-grained control over prosodic features, and support for multiple speakers, it stands out as a powerful tool for developers and researchers alike. By following the installation and usage guidelines provided, you can start leveraging ChatTTS for your own projects and contribute to the ongoing development of this exciting technology.