twenkid commited on
Commit
ea53d03
·
verified ·
1 Parent(s): 5dfa40d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +1 -1
README.md CHANGED
@@ -8,7 +8,7 @@
8
  * It is supposed to be run with the provided code here and in the notebook. Read the comments in gen_comments-1-2023-clean.py
9
  * That was the biggest, as far as I knew, GPT/Transformer model in Bulgarian at the time, except one with unknown size, which was demoed for a few seconds in a video in Linkedin: a display of generation in Bulgarian by a startup called BAIHUI AI in mid 2019. I've written in my blog 1.5B, but I don't remember if they have mentioned a size and now it seems unlikely, they didn't have the resurces of OpenAI, three people, one ML engineer. https://artificial-mind.blogspot.com/2019/07/baihuiai-baihuiai-new-bulgarian-ai.html The company didn't live long, it was a show-off. Now it seems reasonable that their model was GPT2-SMALL, as that was the usual choice for demos even years later; I couldn't get info from the ML engineer of the project M.V. I found several other GPT2-SMALL models trained later here, one for poetry, one by Bulgarian Academy of Science in 2023.
10
  * The files include also a sample method for unlimited length multi-step generation with hidden injections of tokens for directed topic change (but it needed more smoothing etc.). The methods are explained in videos on Youtube.
11
- * The dataset was quite small with a maximum of about 140 MB?/UTF, includes some words and texts in other languages within the text (thus a bit more than 70M characters), but IMO the results were decent (subjectively for the size, no systematic formal test).
12
 
13
 
14
  ## Dataset, preprocessing and training
 
8
  * It is supposed to be run with the provided code here and in the notebook. Read the comments in gen_comments-1-2023-clean.py
9
  * That was the biggest, as far as I knew, GPT/Transformer model in Bulgarian at the time, except one with unknown size, which was demoed for a few seconds in a video in Linkedin: a display of generation in Bulgarian by a startup called BAIHUI AI in mid 2019. I've written in my blog 1.5B, but I don't remember if they have mentioned a size and now it seems unlikely, they didn't have the resurces of OpenAI, three people, one ML engineer. https://artificial-mind.blogspot.com/2019/07/baihuiai-baihuiai-new-bulgarian-ai.html The company didn't live long, it was a show-off. Now it seems reasonable that their model was GPT2-SMALL, as that was the usual choice for demos even years later; I couldn't get info from the ML engineer of the project M.V. I found several other GPT2-SMALL models trained later here, one for poetry, one by Bulgarian Academy of Science in 2023.
10
  * The files include also a sample method for unlimited length multi-step generation with hidden injections of tokens for directed topic change (but it needed more smoothing etc.). The methods are explained in videos on Youtube.
11
+ * The dataset was quite small with a maximum of about 140 MB? UTF8, includes some words and texts in other languages within the text (thus a bit more than 70M characters), but IMO the results were decent (subjectively for the size, no systematic formal test).
12
 
13
 
14
  ## Dataset, preprocessing and training