Text-to-Speech
English
Kokoro-82M / VOICES.md
hexgrad's picture
Upload VOICES.md
c3327e9
|
raw
history blame
7.63 kB

Voices

For each voice, the given grades are intended to be estimates of the quality and quantity of its associated training data, both of which impact overall inference quality.

Subjectively, voices will sound better or worse to different people.

Support for non-English languages may be absent or thin due to weak G2P and/or lack of training data. Some languages are only represented by a small handful or even just one voice (French).

Most voices perform best on a "goldilocks range" of 100-200 tokens out of ~500 possible. Voices may perform worse at the extremes:

  • Weakness on short utterances, especially less than 10-20 tokens. Root cause could be lack of short-utterance training data and/or model architecture. One possible inference mitigation is to bundle shorter utterances together.
  • Rushing on long utterances, especially over 400 tokens. You can chunk down to shorter utterances or adjust the speed parameter to mitigate this.

Target Quality

  • How high quality is the reference voice? This grade may be impacted by audio quality, artifacts, compression, & sample rate.
  • How well do the text labels match the audio? Text/audio misalignment (e.g. from hallucinations) will lower this grade.

Training Duration

  • How much audio was seen during training? Smaller durations result in a lower overall grade.
  • 10 hours <= HH hours < 100 hours
  • 1 hour <= H hours < 10 hours
  • 10 minutes <= MM minutes < 100 minutes
  • 1 minute <= M minutes 🤏 < 10 minutes

American English

  • lang_code='a' in misaki[en]
  • espeak-ng en-us fallback
Name Traits Target Quality Training Duration Overall Grade SHA256
af_heart 🚺❤️ A 0ab5709b
af_alloy 🚺 B MM minutes C 6d877149
af_aoede 🚺 B H hours C+ c03bd1a4
af_bella 🚺🔥 A HH hours A- 8cb64e02
af_jessica 🚺 C MM minutes D cdfdccb8
af_kore 🚺 B H hours C+ 8bfbc512
af_nicole 🚺🎧 B HH hours B- c5561808
af_nova 🚺 B MM minutes C e0233676
af_river 🚺 C MM minutes D e149459b
af_sarah 🚺 B H hours C+ 49bd364e
af_sky 🚺 B M minutes 🤏 C- c799548a
am_adam 🚹 D H hours F+ ced7e284
am_echo 🚹 C MM minutes D 8bcfdc85
am_eric 🚹 C MM minutes D ada66f0e
am_fenrir 🚹 B H hours C+ 98e507ec
am_liam 🚹 C MM minutes D c8255075
am_michael 🚹 B H hours C+ 9a443b79
am_onyx 🚹 C MM minutes D e8452be1
am_puck 🚹 B H hours C+ dd1d8973
am_santa 🚹 C M minutes 🤏 D- 7f2f7582

British English

  • lang_code='b' in misaki[en]
  • espeak-ng en-gb fallback
Name Traits Target Quality Training Duration Overall Grade SHA256
bf_alice 🚺 C MM minutes D d292651b
bf_emma 🚺 B HH hours B- d0a423de
bf_isabella 🚺 B MM minutes C cdd4c370
bf_lily 🚺 C MM minutes D 6e09c2e4
bm_daniel 🚹 C MM minutes D fc3fce4e
bm_fable 🚹 B MM minutes C d44935f3
bm_george 🚹 B MM minutes C f1bc8122
bm_lewis 🚹 C H hours D+ b5204750

Japanese

  • lang_code='j' in misaki[ja]
  • Total Japanese training data: H hours
Name Traits Target Quality Training Duration Overall Grade SHA256 CC BY
jf_alpha 🚺 B H hours C+ 1bf4c9dc
jf_gongitsune 🚺 B MM minutes C 1b171917 gongitsune
jf_nezumi 🚺 B M minutes 🤏 C- d83f007a nezuminoyomeiri
jf_tebukuro 🚺 B MM minutes C 0d691790 tebukurowokaini
jm_kumo 🚹 B M minutes 🤏 C- 98340afd kumonoito

Mandarin Chinese

  • lang_code='z' in misaki[zh]
  • Total Mandarin Chinese training data: H hours
Name Traits Target Quality Training Duration Overall Grade SHA256
zf_xiaobei 🚺 C MM minutes D 9b76be63
zf_xiaoni 🚺 C MM minutes D 95b49f16
zf_xiaoxiao 🚺 C MM minutes D cfaf6f2d
zf_xiaoyi 🚺 C MM minutes D b5235dba
zm_yunjian 🚹 C MM minutes D 76cbf8ba
zm_yunxi 🚹 C MM minutes D dbe6e1ce
zm_yunxia 🚹 C MM minutes D bb2b03b0
zm_yunyang 🚹 C MM minutes D 5238ac22

Spanish

Name Traits SHA256
ef_dora 🚺 d9d69b0f
em_alex 🚹 5eac53f7
em_santa 🚹 aa8620cb

French

  • lang_code='f' in misaki[en]
  • espeak-ng fr-fr
  • Total French training data: <11 hours
Name Traits Target Quality Training Duration Overall Grade SHA256 CC BY
ff_siwis 🚺 B <11 hours B- 8073bf2d SIWIS

Hindi

  • lang_code='h' in misaki[en]
  • espeak-ng hi
  • Total Hindi training data: H hours
Name Traits Target Quality Training Duration Overall Grade SHA256
hf_alpha 🚺 B MM minutes C 06906fe0
hf_beta 🚺 B MM minutes C 63c0a1a6
hm_omega 🚹 B MM minutes C b55f02a8
hm_psi 🚹 B MM minutes C 2f0f055c

Italian

  • lang_code='i' in misaki[en]
  • espeak-ng it
  • Total Italian training data: H hours
Name Traits Target Quality Training Duration Overall Grade SHA256
if_sara 🚺 B MM minutes C 6c0b253b
im_nicola 🚹 B MM minutes C 234ed066

Brazilian Portuguese

Name Traits SHA256
pf_dora 🚺 07e4ff98
pm_alex 🚹 cf0ba8c5
pm_santa 🚹 d4210316