Commit
·
b54e8b2
1
Parent(s):
e0f651c
Update README.md
Browse files
README.md
CHANGED
@@ -27,7 +27,7 @@ license: apache-2.0
|
|
27 |
This is a [sentence-transformers](https://www.SBERT.net) model: It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.
|
28 |
We train the model on a merged training dataset that consists of:
|
29 |
- MS Macro (translated in Vietnamese)
|
30 |
-
-
|
31 |
- 80% of the training set from the Legal Text Retrieval Zalo 2021 challenge
|
32 |
|
33 |
We use [phobert-base-v2](https://github.com/VinAIResearch/PhoBERT) as the pre-trained backbone.
|
@@ -57,7 +57,7 @@ Then you can use the model like this:
|
|
57 |
```python
|
58 |
from sentence_transformers import SentenceTransformer
|
59 |
|
60 |
-
# INPUT TEXT MUST BE ALREADY WORD-SEGMENTED!
|
61 |
sentences = ["Cô ấy là một người vui_tính .", "Cô ấy cười nói suốt cả ngày ."]
|
62 |
|
63 |
model = SentenceTransformer('bkai-foundation-models/vietnamese-bi-encoder')
|
@@ -65,6 +65,13 @@ embeddings = model.encode(sentences)
|
|
65 |
print(embeddings)
|
66 |
```
|
67 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
68 |
## Usage (HuggingFace Transformers)
|
69 |
|
70 |
Without [sentence-transformers](https://www.SBERT.net), you can use the model like this: First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings.
|
@@ -148,4 +155,16 @@ SentenceTransformer(
|
|
148 |
(0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: RobertaModel
|
149 |
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False})
|
150 |
)
|
151 |
-
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
27 |
This is a [sentence-transformers](https://www.SBERT.net) model: It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.
|
28 |
We train the model on a merged training dataset that consists of:
|
29 |
- MS Macro (translated in Vietnamese)
|
30 |
+
- SQuAD v2 (translated in Vietnamese)
|
31 |
- 80% of the training set from the Legal Text Retrieval Zalo 2021 challenge
|
32 |
|
33 |
We use [phobert-base-v2](https://github.com/VinAIResearch/PhoBERT) as the pre-trained backbone.
|
|
|
57 |
```python
|
58 |
from sentence_transformers import SentenceTransformer
|
59 |
|
60 |
+
# INPUT TEXT MUST BE ALREADY WORD-SEGMENTED!
|
61 |
sentences = ["Cô ấy là một người vui_tính .", "Cô ấy cười nói suốt cả ngày ."]
|
62 |
|
63 |
model = SentenceTransformer('bkai-foundation-models/vietnamese-bi-encoder')
|
|
|
65 |
print(embeddings)
|
66 |
```
|
67 |
|
68 |
+
|
69 |
+
## Usage (Widget HuggingFace)
|
70 |
+
The widget use custom pipeline on top of the default pipeline by adding additional word segmenter before PhobertTokenizer. So you do not need to segment words before using the API:
|
71 |
+
|
72 |
+
An example could be seen in Hosted inference API.
|
73 |
+
|
74 |
+
|
75 |
## Usage (HuggingFace Transformers)
|
76 |
|
77 |
Without [sentence-transformers](https://www.SBERT.net), you can use the model like this: First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings.
|
|
|
155 |
(0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: RobertaModel
|
156 |
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False})
|
157 |
)
|
158 |
+
```
|
159 |
+
|
160 |
+
## Citing & Authors
|
161 |
+
|
162 |
+
```bibtex
|
163 |
+
@inproceedings{phobert,
|
164 |
+
title = {{PhoBERT: Pre-trained language models for Vietnamese}},
|
165 |
+
author = {Dat Quoc Nguyen and Anh Tuan Nguyen},
|
166 |
+
booktitle = {Findings of the Association for Computational Linguistics: EMNLP 2020},
|
167 |
+
year = {2020},
|
168 |
+
pages = {1037--1042}
|
169 |
+
}
|
170 |
+
```
|