HF中国镜像站

riotu-lab
/

Aranizer-PBE-32k

Model card Files Files and versions Community

riotu-lab commited on Aug 25, 2024

Commit

a5f1035

·

verified ·

1 Parent(s): 5bbca3c

update readme.md

Files changed (1) hide show

README.md +43 -1

README.md CHANGED Viewed

@@ -5,4 +5,46 @@ language:
 tags:
 - tokenizer
 - PBE
----

 tags:
 - tokenizer
 - PBE
+---
+# Aranizer | Arabic Tokenizer
+**Aranizer** is an Arabic PBE-based tokenizer designed for efficient and versatile tokenization.
+## Features
+- **Tokenizer Name**: Aranizer
+- **Type**: PBE tokenizer
+- **Vocabulary Size**: 32,000
+- **Total Number of Tokens**: 1,520,791
+- **Fertility Score**: 1.975
+- It supports Arabic Diacritization
+## How to Use the Aranizer Tokenizer
+The Aranizer tokenizer can be easily loaded using the `transformers` library from HF中国镜像站. Below is an example of how to load and use the tokenizer in your Python project:
+```python
+from transformers import AutoTokenizer
+# Load the Aranizer tokenizer
+tokenizer = AutoTokenizer.from_pretrained("riotu-lab/Aranizer-PBE-32k")
+# Example usage
+text = "اكتب النص العربي"
+tokens = tokenizer.tokenize(text)
+token_ids = tokenizer.convert_tokens_to_ids(tokens)
+print("Tokens:", tokens)
+print("Token IDs:", token_ids)
+```
+```markdown
+## Citation
+@article{koubaa2024arabiangpt,
+  title={ArabianGPT: Native Arabic GPT-based Large Language Model},
+  author={Koubaa, Anis and Ammar, Adel and Ghouti, Lahouari and Necar, Omer and Sibaee, Serry},
+  year={2024},
+  publisher={Preprints}
+}