๐ค ZipVoice: Zero-shot Vietnamese Text-to-Speech Synthesis using Flow Matching with only 123M parameters.
The model was trained with approximately 2500 hours of data on a RTX 3090 GPU.
Enter text and upload a sample voice to generate natural speech.
๐ Sample Voice
Drop Audio Here
- or -
Click to Upload
๐ Text
โก Speed
โบ
0.3
2
๐ฅ Generate Voice
๐ง Generated Audio
๐ Spectrogram
โ Model Limitations
1. This model may not perform well with numerical characters, dates, special characters, etc. 2. The rhythm of some generated audios may be inconsistent or choppy. 3. Default, reference audio text uses the pho-whisper-medium model, which may not always accurately recognize Vietnamese, resulting in poor voice synthesis quality. 4. Inference with overly long paragraphs may produce poor results. 5. This demo uses a for loop to generate audio for each sentence sequentially in long paragraphs, so the speed may be slow