riotu-lab commited on
Commit
d594e75
·
verified ·
1 Parent(s): 09ae685

update readme.md

Browse files
Files changed (1) hide show
  1. README.md +51 -3
README.md CHANGED
@@ -1,3 +1,51 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - ar
5
+ tags:
6
+ - Aranizer
7
+ - Arabic Tokenizer
8
+ - PBE
9
+ ---
10
+
11
+ # Aranizer | Arabic Tokenizer
12
+
13
+ **Aranizer** is an Arabic PBE-based tokenizer designed for efficient and versatile tokenization.
14
+
15
+ ## Features
16
+
17
+ - **Tokenizer Name**: Aranizer
18
+ - **Type**: PBE tokenizer
19
+ - **Vocabulary Size**: 64,000
20
+ - **Total Number of Tokens**: 1,358,099
21
+ - **Fertility Score**: 1.764
22
+ - It supports Arabic Diacritization
23
+
24
+ ## How to Use the Aranizer Tokenizer
25
+
26
+ The Aranizer tokenizer can be easily loaded using the `transformers` library from HF中国镜像站. Below is an example of how to load and use the tokenizer in your Python project:
27
+
28
+ ```python
29
+ from transformers import AutoTokenizer
30
+
31
+ # Load the Aranizer tokenizer
32
+ tokenizer = AutoTokenizer.from_pretrained("riotu-lab/Aranizer-PBE-64k")
33
+
34
+ # Example usage
35
+ text = "اكتب النص العربي"
36
+ tokens = tokenizer.tokenize(text)
37
+ token_ids = tokenizer.convert_tokens_to_ids(tokens)
38
+
39
+ print("Tokens:", tokens)
40
+ print("Token IDs:", token_ids)
41
+ ```
42
+
43
+ ```markdown
44
+ ## Citation
45
+
46
+ @article{koubaa2024arabiangpt,
47
+ title={ArabianGPT: Native Arabic GPT-based Large Language Model},
48
+ author={Koubaa, Anis and Ammar, Adel and Ghouti, Lahouari and Necar, Omer and Sibaee, Serry},
49
+ year={2024},
50
+ publisher={Preprints}
51
+ }