Voice Cloning

Voice Cloning refers to the process to create custom voice profiles from audio samples using Mirako's deep voice cloning. The process takes the following as input:

Audio Samples: Clean, high-quality recordings of the target voice.
Annotations: Text transcriptions of the audio samples, including language tags.

After performing voice cloning, you essentially create a new custom voice profile that can be used by your avatar for text-to-speech synthesis, interactive sessions.

Supported Languages

Mirako supports voice cloning for:

Cantonese (yue)
Mandarin Chinese (zh)
English (en)

Scripts

You need a set of scripts that contain the spoken content for your audio samples. We have provided sample scripts for different languages that you can used for recording your audio samples. You can also create and use your own scripts.

The repository contains the following sample scripts.

Script File	Language	Description
`english_generic_100.txt`	English	Generic English scripts (~5 minutes)
`english_smalltalk_140.txt`	English	English casual chatting scripts (~5 minutes)
`cantonese_40.txt`	Cantonese	Cantonese sentenses (~3 minutes)
`cantonese_english_mix_100.txt`	Cantonese	Cantonese / English mixed sentenses (~3 minutes)
`cantonese_word_100.txt`	Cantonese	Cantonese Words (~5 minutes)
`mandarin_150_v2.txt`	Mandarin	Mandarin Chinese scripts (~5 minutes)

You can use these scripts as-is, with different combinations. For example, if you want the target voice to speak both Cantonese and English, you can use cantonese_english_mix_100.txt and english_generic_100.txt together.

Audio Requirements

The audio samples you provided should meet the following specifications:

Formats: WAV, MP3
Channels: Mono or stereo
Sample Rate: from 16kHz to 48kHz
Total duration: 5-15 minutes of clean speech
Total file size: maximum 30MB

Number of Samples

You should prepare at least 6 audio samples, each samples ranged from 2 seconds to 15 seconds. The suggested number of samples will be around 20-30 samples in total for each language. For example, if you want to clone a voice speaking both Cantonese and English, the best practice is to prepare 20-30 audio samples for Cantonese and English (i.e. total 40-60 samples) respectively.

To achieve the best results, you may increase the number of samples to 50-100 for each language. Note that while more samples in general lead to more stable results, the key is to have clean, high-quality samples that cover the right expressions and tones.

Quality Guidelines

Voice cloning is a machine learning process that requires high-quality audio samples. The better the quality of your audio, the better the resulting voice output will be. Here are some best practices:

Clear speech: With quiet atmostphere - No background noise, music, or reverb
Consistent speaker: All samples from the same person
Natural delivery: Conversational, not robotic
Good audio quality: Use professional microphone for recording. No clipping, distortion, or artifacts
Varied content: Different sentences and expressions

Note: When preparing multilingual samples, the scripts used for different languages are not necessary having the same meaning (e.g. transcription). You can have different sentences for each language, as long as they are natural and cover a variety of expressions.

Annotation File

The annotation file is a text file that contains the information of your audio samples. It basically tells the voice cloning service what each audio file contains, including the language and transcription of the spoken content.

Format

Each audio file requires a corresponding annotation entry. Create a text file named annotation.txt with this format:

audio_filename|language_tag|transcription

Language Tags

yue - Cantonese
zh - Mandarin Chinese
en - English

Example Annotation File

text

sample_001.wav|en|Hi there, how are you doing today?
sample_002.wav|en|The weather is absolutely beautiful this morning.
sample_003.wav|yue|你知唔知邊度有間好食嘅茶餐廳？
sample_004.wav|yue|我想試吓佢哋嘅招牌菜。
sample_005.wav|zh|我今天早上去市場買了一些水果和蔬菜。
sample_006.wav|zh|這個地方的風景真的很美麗。

Best Practices for Annotations

Accurate Transcriptions: Ensure the text matches exactly what is spoken in the audio.
Consistent Language Tags: Use the correct language tag for each entry.