Text-to-Speech
English
hexgrad commited on
Commit
21e7170
·
1 Parent(s): f0f6f4d

Upload 2 files

Browse files
Files changed (2) hide show
  1. README.md +7 -21
  2. VOICES.md +38 -24
README.md CHANGED
@@ -12,13 +12,13 @@ pipeline_tag: text-to-speech
12
 
13
  **Kokoro** is an open-weight TTS model with 82 million parameters. Despite its lightweight architecture, it delivers comparable quality to larger models while being significantly faster and more cost-efficient. With Apache-licensed weights, Kokoro can be deployed anywhere from production environments to personal projects.
14
 
15
- - [Releases](https://huggingface.co/hexgrad/Kokoro-82M#releases)
16
- - [Usage](https://huggingface.co/hexgrad/Kokoro-82M#usage)
17
- - [Voices and Languages](https://huggingface.co/hexgrad/Kokoro-82M#voices-and-languages)
18
- - [Model Facts](https://huggingface.co/hexgrad/Kokoro-82M#model-facts)
19
- - [Training Details](https://huggingface.co/hexgrad/Kokoro-82M#training-details)
20
- - [Creative Commons Attribution](https://huggingface.co/hexgrad/Kokoro-82M#creative-commons-attribution)
21
- - [Acknowledgements](https://huggingface.co/hexgrad/Kokoro-82M#acknowledgements)
22
 
23
  ### Releases
24
 
@@ -79,20 +79,6 @@ for i, (gs, ps, audio) in enumerate(generator):
79
 
80
  Under the hood, `kokoro` uses [`misaki`](https://pypi.org/project/misaki/), a G2P library at https://github.com/hexgrad/misaki
81
 
82
- ### Voices and Languages
83
-
84
- Voices are listed in [VOICES.md](https://huggingface.co/hexgrad/Kokoro-82M/blob/main/VOICES.md). Not all voices are created equal:
85
- - Subjectively, voices will sound better or worse to different people.
86
- - Less training data for a given voice (minutes instead of hours) => worse inference quality.
87
- - Poor audio quality in training data (compression, sample rate, artifacts) => worse inference quality.
88
- - Text-audio misalignment alignment (too much text i.e. hallucinations, or not enough text i.e. failed transcriptions) => worse inference quality.
89
-
90
- Support for non-English languages may be absent or thin due to weak G2P and/or lack of training data. Some languages are only represented by a small handful or even just one voice (French).
91
-
92
- Most voices perform best on a "goldilocks range" of 100-200 tokens out of ~500 possible. Voices may perform worse at the extremes:
93
- - **Weakness** on short utterances, especially less than 10-20 tokens. Root cause could be lack of short-utterance training data and/or model architecture. One possible inference mitigation is to bundle shorter utterances together.
94
- - **Rushing** on long utterances, especially over 400 tokens. You can chunk down to shorter utterances or adjust the `speed` parameter to mitigate this.
95
-
96
  ### Model Facts
97
 
98
  **Architecture:**
 
12
 
13
  **Kokoro** is an open-weight TTS model with 82 million parameters. Despite its lightweight architecture, it delivers comparable quality to larger models while being significantly faster and more cost-efficient. With Apache-licensed weights, Kokoro can be deployed anywhere from production environments to personal projects.
14
 
15
+ - [Releases](#releases)
16
+ - [Usage](#usage)
17
+ - [VOICES.md](https://huggingface.co/hexgrad/Kokoro-82M/blob/main/VOICES.md) ↗️
18
+ - [Model Facts](#model-facts)
19
+ - [Training Details](#training-details)
20
+ - [Creative Commons Attribution](#creative-commons-attribution)
21
+ - [Acknowledgements](#acknowledgements)
22
 
23
  ### Releases
24
 
 
79
 
80
  Under the hood, `kokoro` uses [`misaki`](https://pypi.org/project/misaki/), a G2P library at https://github.com/hexgrad/misaki
81
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
82
  ### Model Facts
83
 
84
  **Architecture:**
VOICES.md CHANGED
@@ -1,9 +1,23 @@
1
  # Voices
2
 
 
 
 
 
 
 
 
 
3
  For each voice, the given grades are intended to be estimates of the **quality and quantity** of its associated training data, both of which impact overall inference quality.
4
 
5
  Subjectively, voices will sound better or worse to different people.
6
 
 
 
 
 
 
 
7
  **Target Quality**
8
  - How high quality is the reference voice? This grade may be impacted by audio quality, artifacts, compression, & sample rate.
9
  - How well do the text labels match the audio? Text/audio misalignment (e.g. from hallucinations) will lower this grade.
@@ -15,10 +29,10 @@ Subjectively, voices will sound better or worse to different people.
15
  - 10 minutes <= MM minutes < 100 minutes
16
  - 1 minute <= _M minutes_ < 10 minutes 🤏
17
 
18
- ### American English 🇺🇸
19
 
20
- - `lang_code='a'` in [`misaki[en]`](https://github.com/hexgrad/misaki)
21
- - espeak-ng `en-us` fallback
22
 
23
  | Name | Traits | Target Quality | Training Duration | Overall Grade | SHA256 |
24
  | ---- | ------ | -------------- | ----------------- | ------------- | ------ |
@@ -42,10 +56,10 @@ Subjectively, voices will sound better or worse to different people.
42
  | am_puck | 🚹 | B | H hours | C+ | `dd1d8973` |
43
  | am_santa | 🚹🤏 | C | _M minutes_ | D- | `7f2f7582` |
44
 
45
- ### British English 🇬🇧
46
 
47
- - `lang_code='b'` in [`misaki[en]`](https://github.com/hexgrad/misaki)
48
- - espeak-ng `en-gb` fallback
49
 
50
  | Name | Traits | Target Quality | Training Duration | Overall Grade | SHA256 |
51
  | ---- | ------ | -------------- | ----------------- | ------------- | ------ |
@@ -58,21 +72,21 @@ Subjectively, voices will sound better or worse to different people.
58
  | bm_george | 🚹 | B | MM minutes | C | `f1bc8122` |
59
  | bm_lewis | 🚹 | C | H hours | D+ | `b5204750` |
60
 
61
- ### French 🇫🇷
62
 
63
- - `lang_code='f'` in [`misaki[en]`](https://github.com/hexgrad/misaki)
64
- - espeak-ng `fr-fr`
65
- - Total French training data: <11 hours
66
 
67
  | Name | Traits | Target Quality | Training Duration | Overall Grade | SHA256 | CC BY |
68
  | ---- | ------ | -------------- | ----------------- | ------------- | ------ | ----- |
69
  | ff_siwis | 🚺 | B | <11 hours | B- | `8073bf2d` | [SIWIS](https://datashare.ed.ac.uk/handle/10283/2353) |
70
 
71
- ### Hindi 🇮🇳
72
 
73
- - `lang_code='h'` in [`misaki[en]`](https://github.com/hexgrad/misaki)
74
- - espeak-ng `hi`
75
- - Total Hindi training data: H hours
76
 
77
  | Name | Traits | Target Quality | Training Duration | Overall Grade | SHA256 |
78
  | ---- | ------ | -------------- | ----------------- | ------------- | ------ |
@@ -81,21 +95,21 @@ Subjectively, voices will sound better or worse to different people.
81
  | hm_omega | 🚹 | B | MM minutes | C | `b55f02a8` |
82
  | hm_psi | 🚹 | B | MM minutes | C | `2f0f055c` |
83
 
84
- ### Italian 🇮🇳
85
 
86
- - `lang_code='i'` in [`misaki[en]`](https://github.com/hexgrad/misaki)
87
- - espeak-ng `it`
88
- - Total Italian training data: H hours
89
 
90
  | Name | Traits | Target Quality | Training Duration | Overall Grade | SHA256 |
91
  | ---- | ------ | -------------- | ----------------- | ------------- | ------ |
92
  | if_sara | 🚺 | B | MM minutes | C | `6c0b253b` |
93
  | im_nicola | 🚹 | B | MM minutes | C | `234ed066` |
94
 
95
- ### Japanese 🇯🇵
96
 
97
- - `lang_code='j'` in [`misaki[ja]`](https://github.com/hexgrad/misaki)
98
- - Total Japanese training data: H hours
99
 
100
  | Name | Traits | Target Quality | Training Duration | Overall Grade | SHA256 | CC BY |
101
  | ---- | ------ | -------------- | ----------------- | ------------- | ------ | ----- |
@@ -105,10 +119,10 @@ Subjectively, voices will sound better or worse to different people.
105
  | jf_tebukuro | 🚺 | B | MM minutes | C | `0d691790` | [tebukurowokaini](https://github.com/koniwa/koniwa/blob/master/source/tnc/tnc__tebukurowokaini.txt) |
106
  | jm_kumo | 🚹🤏 | B | _M minutes_ | C- | `98340afd` | [kumonoito](https://github.com/koniwa/koniwa/blob/master/source/tnc/tnc__kumonoito.txt) |
107
 
108
- ### Mandarin Chinese 🇨🇳
109
 
110
- - `lang_code='z'` in [`misaki[zh]`](https://github.com/hexgrad/misaki)
111
- - Total Mandarin Chinese training data: H hours
112
 
113
  | Name | Traits | Target Quality | Training Duration | Overall Grade | SHA256 |
114
  | ---- | ------ | -------------- | ----------------- | ------------- | ------ |
 
1
  # Voices
2
 
3
+ 🇺🇸 [American English](#american-english): 10F 9M
4
+ 🇬🇧 [British English](#british-english): 4F 4M
5
+ 🇫🇷 [French](#french): 1F
6
+ 🇮🇳 [Hindi](#hindi): 2F 2M
7
+ 🇮🇹 [Italian](#italian): 1F 1M
8
+ 🇯🇵 [Japanese](#japanese): 4F 1M
9
+ 🇨🇳 [Mandarin Chinese](#mandarin-chinese): 4F 4M
10
+
11
  For each voice, the given grades are intended to be estimates of the **quality and quantity** of its associated training data, both of which impact overall inference quality.
12
 
13
  Subjectively, voices will sound better or worse to different people.
14
 
15
+ Support for non-English languages may be absent or thin due to weak G2P and/or lack of training data. Some languages are only represented by a small handful or even just one voice (French).
16
+
17
+ Most voices perform best on a "goldilocks range" of 100-200 tokens out of ~500 possible. Voices may perform worse at the extremes:
18
+ - **Weakness** on short utterances, especially less than 10-20 tokens. Root cause could be lack of short-utterance training data and/or model architecture. One possible inference mitigation is to bundle shorter utterances together.
19
+ - **Rushing** on long utterances, especially over 400 tokens. You can chunk down to shorter utterances or adjust the `speed` parameter to mitigate this.
20
+
21
  **Target Quality**
22
  - How high quality is the reference voice? This grade may be impacted by audio quality, artifacts, compression, & sample rate.
23
  - How well do the text labels match the audio? Text/audio misalignment (e.g. from hallucinations) will lower this grade.
 
29
  - 10 minutes <= MM minutes < 100 minutes
30
  - 1 minute <= _M minutes_ < 10 minutes 🤏
31
 
32
+ ### American English
33
 
34
+ 🇺🇸 `lang_code='a'` in [`misaki[en]`](https://github.com/hexgrad/misaki)
35
+ 🇺🇸 espeak-ng `en-us` fallback
36
 
37
  | Name | Traits | Target Quality | Training Duration | Overall Grade | SHA256 |
38
  | ---- | ------ | -------------- | ----------------- | ------------- | ------ |
 
56
  | am_puck | 🚹 | B | H hours | C+ | `dd1d8973` |
57
  | am_santa | 🚹🤏 | C | _M minutes_ | D- | `7f2f7582` |
58
 
59
+ ### British English
60
 
61
+ 🇬🇧 `lang_code='b'` in [`misaki[en]`](https://github.com/hexgrad/misaki)
62
+ 🇬🇧 espeak-ng `en-gb` fallback
63
 
64
  | Name | Traits | Target Quality | Training Duration | Overall Grade | SHA256 |
65
  | ---- | ------ | -------------- | ----------------- | ------------- | ------ |
 
72
  | bm_george | 🚹 | B | MM minutes | C | `f1bc8122` |
73
  | bm_lewis | 🚹 | C | H hours | D+ | `b5204750` |
74
 
75
+ ### French
76
 
77
+ 🇫🇷 `lang_code='f'` in [`misaki[en]`](https://github.com/hexgrad/misaki)
78
+ 🇫🇷 espeak-ng `fr-fr`
79
+ 🇫🇷 Total French training data: <11 hours
80
 
81
  | Name | Traits | Target Quality | Training Duration | Overall Grade | SHA256 | CC BY |
82
  | ---- | ------ | -------------- | ----------------- | ------------- | ------ | ----- |
83
  | ff_siwis | 🚺 | B | <11 hours | B- | `8073bf2d` | [SIWIS](https://datashare.ed.ac.uk/handle/10283/2353) |
84
 
85
+ ### Hindi
86
 
87
+ 🇮🇳 `lang_code='h'` in [`misaki[en]`](https://github.com/hexgrad/misaki)
88
+ 🇮🇳 espeak-ng `hi`
89
+ 🇮🇳 Total Hindi training data: H hours
90
 
91
  | Name | Traits | Target Quality | Training Duration | Overall Grade | SHA256 |
92
  | ---- | ------ | -------------- | ----------------- | ------------- | ------ |
 
95
  | hm_omega | 🚹 | B | MM minutes | C | `b55f02a8` |
96
  | hm_psi | 🚹 | B | MM minutes | C | `2f0f055c` |
97
 
98
+ ### Italian
99
 
100
+ 🇮🇹 `lang_code='i'` in [`misaki[en]`](https://github.com/hexgrad/misaki)
101
+ 🇮🇹 espeak-ng `it`
102
+ 🇮🇹 Total Italian training data: H hours
103
 
104
  | Name | Traits | Target Quality | Training Duration | Overall Grade | SHA256 |
105
  | ---- | ------ | -------------- | ----------------- | ------------- | ------ |
106
  | if_sara | 🚺 | B | MM minutes | C | `6c0b253b` |
107
  | im_nicola | 🚹 | B | MM minutes | C | `234ed066` |
108
 
109
+ ### Japanese
110
 
111
+ 🇯🇵 `lang_code='j'` in [`misaki[ja]`](https://github.com/hexgrad/misaki)
112
+ 🇯🇵 Total Japanese training data: H hours
113
 
114
  | Name | Traits | Target Quality | Training Duration | Overall Grade | SHA256 | CC BY |
115
  | ---- | ------ | -------------- | ----------------- | ------------- | ------ | ----- |
 
119
  | jf_tebukuro | 🚺 | B | MM minutes | C | `0d691790` | [tebukurowokaini](https://github.com/koniwa/koniwa/blob/master/source/tnc/tnc__tebukurowokaini.txt) |
120
  | jm_kumo | 🚹🤏 | B | _M minutes_ | C- | `98340afd` | [kumonoito](https://github.com/koniwa/koniwa/blob/master/source/tnc/tnc__kumonoito.txt) |
121
 
122
+ ### Mandarin Chinese
123
 
124
+ 🇨🇳 `lang_code='z'` in [`misaki[zh]`](https://github.com/hexgrad/misaki)
125
+ 🇨🇳 Total Mandarin Chinese training data: H hours
126
 
127
  | Name | Traits | Target Quality | Training Duration | Overall Grade | SHA256 |
128
  | ---- | ------ | -------------- | ----------------- | ------------- | ------ |