Update README.md
Browse files
README.md
CHANGED
@@ -8,7 +8,7 @@
|
|
8 |
* It is supposed to be run with the provided code here and in the notebook. Read the comments in gen_comments-1-2023-clean.py
|
9 |
* That was the biggest, as far as I knew, GPT/Transformer model in Bulgarian at the time, except one with unknown size, which was demoed for a few seconds in a video in Linkedin: a display of generation in Bulgarian by a startup called BAIHUI AI in mid 2019. I've written in my blog 1.5B, but I don't remember if they have mentioned a size and now it seems unlikely, they didn't have the resurces of OpenAI, three people, one ML engineer. https://artificial-mind.blogspot.com/2019/07/baihuiai-baihuiai-new-bulgarian-ai.html The company didn't live long, it was a show-off. Now it seems reasonable that their model was GPT2-SMALL, as that was the usual choice for demos even years later; I couldn't get info from the ML engineer of the project M.V. I found several other GPT2-SMALL models trained later here, one for poetry, one by Bulgarian Academy of Science in 2023.
|
10 |
* The files include also a sample method for unlimited length multi-step generation with hidden injections of tokens for directed topic change (but it needed more smoothing etc.). The methods are explained in videos on Youtube.
|
11 |
-
* The dataset was quite small with a maximum of about 140 MB
|
12 |
|
13 |
|
14 |
## Dataset, preprocessing and training
|
|
|
8 |
* It is supposed to be run with the provided code here and in the notebook. Read the comments in gen_comments-1-2023-clean.py
|
9 |
* That was the biggest, as far as I knew, GPT/Transformer model in Bulgarian at the time, except one with unknown size, which was demoed for a few seconds in a video in Linkedin: a display of generation in Bulgarian by a startup called BAIHUI AI in mid 2019. I've written in my blog 1.5B, but I don't remember if they have mentioned a size and now it seems unlikely, they didn't have the resurces of OpenAI, three people, one ML engineer. https://artificial-mind.blogspot.com/2019/07/baihuiai-baihuiai-new-bulgarian-ai.html The company didn't live long, it was a show-off. Now it seems reasonable that their model was GPT2-SMALL, as that was the usual choice for demos even years later; I couldn't get info from the ML engineer of the project M.V. I found several other GPT2-SMALL models trained later here, one for poetry, one by Bulgarian Academy of Science in 2023.
|
10 |
* The files include also a sample method for unlimited length multi-step generation with hidden injections of tokens for directed topic change (but it needed more smoothing etc.). The methods are explained in videos on Youtube.
|
11 |
+
* The dataset was quite small with a maximum of about 140 MB? UTF8, includes some words and texts in other languages within the text (thus a bit more than 70M characters), but IMO the results were decent (subjectively for the size, no systematic formal test).
|
12 |
|
13 |
|
14 |
## Dataset, preprocessing and training
|