HF中国镜像站

twenkid
/

gpt2-medium-bg

Model card Files Files and versions Community

twenkid commited on Jan 2

Commit

ea53d03

·

verified ·

1 Parent(s): 5dfa40d

Update README.md

Files changed (1) hide show

README.md +1 -1

README.md CHANGED Viewed

@@ -8,7 +8,7 @@
 * It is supposed to be run with the provided code here and in the notebook. Read the comments in gen_comments-1-2023-clean.py
 * That was the biggest, as far as I knew, GPT/Transformer model in Bulgarian at the time, except one with unknown size, which was demoed for a few seconds in a video in Linkedin: a display of generation in Bulgarian by a startup called BAIHUI AI in mid 2019. I've written in my blog 1.5B, but I don't remember if they have mentioned a size and now it seems unlikely, they didn't have the resurces of OpenAI, three people, one ML engineer. https://artificial-mind.blogspot.com/2019/07/baihuiai-baihuiai-new-bulgarian-ai.html The company didn't live long, it was a show-off. Now it seems reasonable that their model was GPT2-SMALL, as that was the usual choice for demos even years later; I couldn't get info from the ML engineer of the project M.V. I found several other GPT2-SMALL models trained later here, one for poetry, one by Bulgarian Academy of Science in 2023.
 * The files include also a sample method for unlimited length multi-step generation with hidden injections of tokens for directed topic change (but it needed more smoothing etc.). The methods are explained in videos on Youtube.
-* The dataset was quite small with a maximum of about 140 MB?/UTF, includes some words and texts in other languages within the text (thus a bit more than 70M characters), but IMO the results were decent (subjectively for the size, no systematic formal test).
 ## Dataset, preprocessing and training

 * It is supposed to be run with the provided code here and in the notebook. Read the comments in gen_comments-1-2023-clean.py
 * That was the biggest, as far as I knew, GPT/Transformer model in Bulgarian at the time, except one with unknown size, which was demoed for a few seconds in a video in Linkedin: a display of generation in Bulgarian by a startup called BAIHUI AI in mid 2019. I've written in my blog 1.5B, but I don't remember if they have mentioned a size and now it seems unlikely, they didn't have the resurces of OpenAI, three people, one ML engineer. https://artificial-mind.blogspot.com/2019/07/baihuiai-baihuiai-new-bulgarian-ai.html The company didn't live long, it was a show-off. Now it seems reasonable that their model was GPT2-SMALL, as that was the usual choice for demos even years later; I couldn't get info from the ML engineer of the project M.V. I found several other GPT2-SMALL models trained later here, one for poetry, one by Bulgarian Academy of Science in 2023.
 * The files include also a sample method for unlimited length multi-step generation with hidden injections of tokens for directed topic change (but it needed more smoothing etc.). The methods are explained in videos on Youtube.
+* The dataset was quite small with a maximum of about 140 MB? UTF8, includes some words and texts in other languages within the text (thus a bit more than 70M characters), but IMO the results were decent (subjectively for the size, no systematic formal test).
 ## Dataset, preprocessing and training