VibeVoice (TTS) — is a model for generating natural conversational dialogues from text, capable of creating dialogues with up to 4 speakers and durations of up to 90 minutes.
Key Features:
- Two models: small and large
- Up to 4 speakers in a single recording
- Up to 90 minutes of generated audio
- Language support: officially supports 2 languages: English (default) and Chinese, but it has been verified to work decently for other languages as well.
How to use the model
The text must be in English or Chinese; quality is not guaranteed for other languages. The maximum text length is 5000 characters. Avoid special characters. The text must be formatted specifically to indicate speakers:
Correct format:
Speaker 1: Hello! How are you today?
Speaker 2: I'm doing great, thanks for asking!
Speaker 1: That's wonderful to hear.
Speaker 3: Hey everyone, sorry I'm late!
Incorrect format:
Hello! How are you today?
I'm doing great!
Important:
- Each line must start with
Speaker N:(where N is a number from 1 to 4) - Speaker numbering: Speaker 1, Speaker 2, Speaker 3, Speaker 4
- You can use from 1 to 4 speakers
- Case does not matter:
Speaker 1:=speaker 1:=SPEAKER 1
If you need a monologue, you do not need to specify a speaker.
Example scenarios:
Monologue (1 speaker):
Speaker 1: Today I want to talk about artificial intelligence.
Speaker 1: It's changing our world in incredible ways.
Speaker 1: From healthcare to entertainment, AI is everywhere.
Dialogue (2 speakers):
Speaker 1: Have you tried the new restaurant downtown?
Speaker 2: Not yet, but I've heard great things about it!
Speaker 1: We should go there this weekend.
Speaker 2: That sounds like a perfect plan!
Group conversation (3-4 speakers):
Speaker 1: Welcome to our podcast, everyone!
Speaker 2: Thanks for having us!
Speaker 3: It's great to be here.
Speaker 4: I'm excited to share our thoughts today.
Speaker 1: Let's start with introductions.