HF中国镜像站

riotu-lab
/

Aranizer-PBE-64k

Arabic Tokenizer

Model card Files Files and versions Community

riotu-lab commited on Aug 25, 2024

Commit

d594e75

·

verified ·

1 Parent(s): 09ae685

update readme.md

Files changed (1) hide show

README.md +51 -3

README.md CHANGED Viewed

@@ -1,3 +1,51 @@
----
-license: apache-2.0
----

+---
+license: apache-2.0
+language:
+- ar
+tags:
+- Aranizer
+- Arabic Tokenizer
+- PBE
+---
+# Aranizer | Arabic Tokenizer
+**Aranizer** is an Arabic PBE-based tokenizer designed for efficient and versatile tokenization.
+## Features
+- **Tokenizer Name**: Aranizer
+- **Type**: PBE tokenizer
+- **Vocabulary Size**: 64,000
+- **Total Number of Tokens**: 1,358,099
+- **Fertility Score**: 1.764
+- It supports Arabic Diacritization
+## How to Use the Aranizer Tokenizer
+The Aranizer tokenizer can be easily loaded using the `transformers` library from HF中国镜像站. Below is an example of how to load and use the tokenizer in your Python project:
+```python
+from transformers import AutoTokenizer
+# Load the Aranizer tokenizer
+tokenizer = AutoTokenizer.from_pretrained("riotu-lab/Aranizer-PBE-64k")
+# Example usage
+text = "اكتب النص العربي"
+tokens = tokenizer.tokenize(text)
+token_ids = tokenizer.convert_tokens_to_ids(tokens)
+print("Tokens:", tokens)
+print("Token IDs:", token_ids)
+```
+```markdown
+## Citation
+@article{koubaa2024arabiangpt,
+  title={ArabianGPT: Native Arabic GPT-based Large Language Model},
+  author={Koubaa, Anis and Ammar, Adel and Ghouti, Lahouari and Necar, Omer and Sibaee, Serry},
+  year={2024},
+  publisher={Preprints}
+}