VibeVoice (Voice Cloning)

VibeVoice — is a model for generating natural conversational dialogues from text with the ability to use a reference voice for cloning purposes.

Key features:

Two models: small and large
Up to 90 minutes of generated audio
Language support: 2 languages are supported: English (default) and Chinese
Voice cloning: ability to upload a reference audio recording

How to use the model

The text must be only in English or Chinese; quality is not guaranteed for other languages. Maximum text length is 5000 characters. Avoid special characters.
Audio with the reference voice requires 5 to 15 seconds. If your track is longer, it will be automatically trimmed at the 15th second.
The reference track should contain only voice and nothing else. If you have background sounds or music, use the "Extract vocals first" option.

How to generate a reference track?

We need phonetic diversity (all sounds of the language) and lively intonation. A text length of about 35–40 words when read calmly will take just ~15 seconds.

Here are three options in English for different tasks:

Option 1: Universal (Balanced & Clear)

The best choice for general use. Contains complex sound combinations to tune clarity.

"To create a perfect voice clone, the AI needs to hear a full range of phonetic sounds. I am speaking clearly, taking small pauses, and asking: can you hear every detail? This short sample captures the unique texture and tone of my voice."

Option 2: Conversational (Vlog & Social Media)

For voiceovers in videos, YouTube, or blogs. Read vividly, with a smile, changing the pitch of your voice.

"Hey! I’m recording this clip to test how well the new technology works. The secret is to relax and speak exactly like I would to a friend. Do you think the AI can really copy my style and energy in just fifteen seconds?"

Option 3: Professional (Business & Narration)

For presentations, audiobooks, or official announcements. Read confidently, slightly slower, emphasizing word endings.

"Voice synthesis technology is rapidly changing how we communicate in the digital age. It is essential to speak with confidence and precision to ensure high-quality output. This brief recording provides all the necessary data for a professional and accurate digital clone."

Tips for recording:

Pronunciation: Try to articulate word endings clearly (especially t, d, s, ing). Models "love" clear articulation.
Flow: Don't read like a robot. In English, melody (voice melody) is important — the voice should "float" up and down a bit, rather than sounding on a single note.
Breathing: If you pause at a comma or period, don't be afraid to take an audible breath. This will add realism to the clone.

🗎 Copy link Use algorithm Demo