twenkid commited on
Commit
8fc1084
·
verified ·
1 Parent(s): 95e0f49

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +4 -3
README.md CHANGED
@@ -7,8 +7,8 @@
7
  * The model was created and trained from scratch, using tensorflow in free Google Colab T4. The research experiment started in June 2021 and continued to September, the video explanation is uploaded on 17.9.2024.
8
  * It is supposed to be run with the provided code here and in the notebook. Read the comments in gen_comments-1-2023-clean.py
9
  * That was the biggest, as far as I knew, GPT/Transformer model in Bulgarian at the time, except one with unknown size, which was demoed for a few seconds in a video in Linkedin: a display of generation in Bulgarian by a startup called BAIHUI AI in mid 2019. I've written in my blog 1.5B, but I don't remember if they have mentioned a size and now it seems unlikely, they didn't have the resurces of OpenAI, three people, one ML engineer. https://artificial-mind.blogspot.com/2019/07/baihuiai-baihuiai-new-bulgarian-ai.html The company didn't live long, it was a show-off. Now it seems reasonable that their model was GPT2-SMALL, as that was the usual choice for demos even years later; I couldn't get info from the ML engineer of the project M.V. I found several other GPT2-SMALL models trained later here, one for poetry, one by Bulgarian Academy of Science in 2023.
10
- * The files include also a sample method for unlimited length multi-step generation with hidden injections of tokens for directed topic change (but it needed more smoothing etc.). The methods are explained in videos on Youtube.
11
- * The dataset was quite small with a maximum of about 140 MB? UTF8, includes some words and texts in other languages within the text (thus a bit more than 70M characters), but IMO the results were decent (subjectively for the size, no systematic formal test).
12
 
13
 
14
  ## Dataset, preprocessing and training
@@ -16,7 +16,8 @@
16
  * Various selected books from "Chitanka" with some cleaning of the marking in the books of notes and footnotes ([34]... etc.), ids, link of the books etc.
17
  * Various works, books, publications and texts, written by the author himself, the biggest file of them was the current version of the draft of a big survey book on AGI & Transhumanism, also several other interdisciplinary books. Many articles and works from the e-zine "The Sacred Computer" (Свещеният сметач).
18
  * Some poetry by Hristo Botev
19
- * A few articles about computers from forums and web pages, a bit in Bulgarian, some machine translated from English to Bulgarian
 
20
  * During training the dataset and its sampling were incrementally updated or changed after observing the generations and I recognized the source of the style of the outputs. E.g. some books seemed "poisonous" with their patterns and were reduced or removed, e.g. I.Asimov's extensive repetition of the characters' names.
21
  * Some items were added, others removed, some smaller documents were consumed multiple times, a shorter random section selected from items which were too big, different section in each iteration etc.
22
  * Due to the usage of a free Colab Notebook and limited range of uninterruptable hours, maybe up to 3 hours or so, sometimes less, occasionally a few hours more, with unknown end, it was impossible to perform a complete iteration on the entire dataset in one part (it may become impossible to fit at once too big dataset due to RAM as well).
 
7
  * The model was created and trained from scratch, using tensorflow in free Google Colab T4. The research experiment started in June 2021 and continued to September, the video explanation is uploaded on 17.9.2024.
8
  * It is supposed to be run with the provided code here and in the notebook. Read the comments in gen_comments-1-2023-clean.py
9
  * That was the biggest, as far as I knew, GPT/Transformer model in Bulgarian at the time, except one with unknown size, which was demoed for a few seconds in a video in Linkedin: a display of generation in Bulgarian by a startup called BAIHUI AI in mid 2019. I've written in my blog 1.5B, but I don't remember if they have mentioned a size and now it seems unlikely, they didn't have the resurces of OpenAI, three people, one ML engineer. https://artificial-mind.blogspot.com/2019/07/baihuiai-baihuiai-new-bulgarian-ai.html The company didn't live long, it was a show-off. Now it seems reasonable that their model was GPT2-SMALL, as that was the usual choice for demos even years later; I couldn't get info from the ML engineer of the project M.V. I found several other GPT2-SMALL models trained later here, one for poetry, one by Bulgarian Academy of Science in 2023.
10
+ * A method for unlimited length multi-step generation with hidden injections of tokens for directed topic change (but it needed more smoothing etc.). The methods are explained in videos on Youtube.
11
+ * The dataset was quite small with a maximum of about 140 MB? UTF8, includes some words and texts in other languages within the text (thus more than 70M characters), but IMO the results were decent (subjectively for the size, no systematic formal test).
12
 
13
 
14
  ## Dataset, preprocessing and training
 
16
  * Various selected books from "Chitanka" with some cleaning of the marking in the books of notes and footnotes ([34]... etc.), ids, link of the books etc.
17
  * Various works, books, publications and texts, written by the author himself, the biggest file of them was the current version of the draft of a big survey book on AGI & Transhumanism, also several other interdisciplinary books. Many articles and works from the e-zine "The Sacred Computer" (Свещеният сметач).
18
  * Some poetry by Hristo Botev
19
+ * A few articles about computers from forums and web pages, a bit in Bulgarian, some machine translated from English to Bulgarian.
20
+ * Some articles from magazines on political topics
21
  * During training the dataset and its sampling were incrementally updated or changed after observing the generations and I recognized the source of the style of the outputs. E.g. some books seemed "poisonous" with their patterns and were reduced or removed, e.g. I.Asimov's extensive repetition of the characters' names.
22
  * Some items were added, others removed, some smaller documents were consumed multiple times, a shorter random section selected from items which were too big, different section in each iteration etc.
23
  * Due to the usage of a free Colab Notebook and limited range of uninterruptable hours, maybe up to 3 hours or so, sometimes less, occasionally a few hours more, with unknown end, it was impossible to perform a complete iteration on the entire dataset in one part (it may become impossible to fit at once too big dataset due to RAM as well).