Voice Cloning
Voice Cloning refers to the process to create custom voice profiles from audio samples using Mirako's deep voice cloning. The process takes the following as input:
- Audio Samples: Clean, high-quality recordings of the target voice.
- Annotations: Text transcriptions of the audio samples, including language tags.
After performing voice cloning, you essentially create a new custom voice profile that can be used by your avatar for text-to-speech synthesis, interactive sessions.
Supported Languages
Mirako supports voice cloning for:
- Cantonese (yue)
- Mandarin Chinese (zh)
- English (en)
Scripts
You need a set of scripts that contain the spoken content for your audio samples. We have provided sample scripts for different languages that you can used for recording your audio samples. You can also create and use your own scripts.
The repository contains the following sample scripts.
Script File | Language | Description |
---|---|---|
english_generic_100.txt |
English | Generic English scripts (~5 minutes) |
english_smalltalk_140.txt |
English | English casual chatting scripts (~5 minutes) |
cantonese_40.txt |
Cantonese | Cantonese sentenses (~3 minutes) |
cantonese_english_mix_100.txt |
Cantonese | Cantonese / English mixed sentenses (~3 minutes) |
cantonese_word_100.txt |
Cantonese | Cantonese Words (~5 minutes) |
mandarin_150_v2.txt |
Mandarin | Mandarin Chinese scripts (~5 minutes) |
You can use these scripts as-is, with different combinations. For example, if you want the target voice to speak both Cantonese and English, you can use cantonese_english_mix_100.txt
and english_generic_100.txt
together.
Audio Requirements
The audio samples you provided should meet the following specifications:
- Formats: WAV, MP3
- Channels: Mono or stereo
- Sample Rate: from 16kHz to 48kHz
- Total duration: 5-15 minutes of clean speech
- Total file size: maximum 30MB
Number of Samples
You should prepare at least 6 audio samples, each samples ranged from 2 seconds to 15 seconds. The suggested number of samples will be around 20-30 samples in total for each language. For example, if you want to clone a voice speaking both Cantonese and English, the best practice is to prepare 20-30 audio samples for Cantonese and English (i.e. total 40-60 samples) respectively.
To achieve the best results, you may increase the number of samples to 50-100 for each language. Note that while more samples in general lead to more stable results, the key is to have clean, high-quality samples that cover the right expressions and tones.
Quality Guidelines
Voice cloning is a machine learning process that requires high-quality audio samples. The better the quality of your audio, the better the resulting voice output will be. Here are some best practices:
- Clear speech: With quiet atmostphere - No background noise, music, or reverb
- Consistent speaker: All samples from the same person
- Natural delivery: Conversational, not robotic
- Good audio quality: Use professional microphone for recording. No clipping, distortion, or artifacts
- Varied content: Different sentences and expressions
Note: When preparing multilingual samples, the scripts used for different languages are not necessary having the same meaning (e.g. transcription). You can have different sentences for each language, as long as they are natural and cover a variety of expressions.
Annotation File
The annotation file is a text file that contains the information of your audio samples. It basically tells the voice cloning service what each audio file contains, including the language and transcription of the spoken content.
Format
Each audio file requires a corresponding annotation entry. Create a text file named annotation.txt
with this format:
audio_filename|language_tag|transcription
Language Tags
yue
- Cantonesezh
- Mandarin Chineseen
- English
Example Annotation File
sample_001.wav|en|Hi there, how are you doing today?
sample_002.wav|en|The weather is absolutely beautiful this morning.
sample_003.wav|yue|你知唔知邊度有間好食嘅茶餐廳?
sample_004.wav|yue|我想試吓佢哋嘅招牌菜。
sample_005.wav|zh|我今天早上去市場買了一些水果和蔬菜。
sample_006.wav|zh|這個地方的風景真的很美麗。
Best Practices for Annotations
- Accurate Transcriptions: Ensure the text matches exactly what is spoken in the audio.
- Consistent Language Tags: Use the correct language tag for each entry.