bkai-foundation-models
/

vietnamese-bi-encoder

@@ -27,7 +27,7 @@ license: apache-2.0
 This is a [sentence-transformers](https://www.SBERT.net) model: It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.
 We train the model on a merged training dataset that consists of:
   - MS Macro (translated in Vietnamese)
-  - Squadv2 (translated in Vietnamese)
   - 80% of the training set from the Legal Text Retrieval Zalo 2021 challenge
 We use [phobert-base-v2](https://github.com/VinAIResearch/PhoBERT) as the pre-trained backbone.
@@ -57,7 +57,7 @@ Then you can use the model like this:
 ```python
 from sentence_transformers import SentenceTransformer
-# INPUT TEXT MUST BE ALREADY WORD-SEGMENTED!
 sentences = ["Cô ấy là một người vui_tính .", "Cô ấy cười nói suốt cả ngày ."]
 model = SentenceTransformer('bkai-foundation-models/vietnamese-bi-encoder')
@@ -65,6 +65,13 @@ embeddings = model.encode(sentences)
 print(embeddings)
 ```
 ## Usage (HuggingFace Transformers)
 Without [sentence-transformers](https://www.SBERT.net), you can use the model like this: First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings.
@@ -148,4 +155,16 @@ SentenceTransformer(
   (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: RobertaModel
   (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False})
 )
-```

 This is a [sentence-transformers](https://www.SBERT.net) model: It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.
 We train the model on a merged training dataset that consists of:
   - MS Macro (translated in Vietnamese)
+  - SQuAD v2  (translated in Vietnamese)
   - 80% of the training set from the Legal Text Retrieval Zalo 2021 challenge
 We use [phobert-base-v2](https://github.com/VinAIResearch/PhoBERT) as the pre-trained backbone.
 ```python
 from sentence_transformers import SentenceTransformer
+# INPUT TEXT MUST BE ALREADY WORD-SEGMENTED!
 sentences = ["Cô ấy là một người vui_tính .", "Cô ấy cười nói suốt cả ngày ."]
 model = SentenceTransformer('bkai-foundation-models/vietnamese-bi-encoder')
 print(embeddings)
 ```
+## Usage (Widget HuggingFace)
+The widget use custom pipeline on top of the default pipeline by adding additional word segmenter before PhobertTokenizer. So you do not need to segment words before using the API:
+An example could be seen in Hosted inference API.
 ## Usage (HuggingFace Transformers)
 Without [sentence-transformers](https://www.SBERT.net), you can use the model like this: First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings.
   (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: RobertaModel
   (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False})
 )
+```
+## Citing & Authors
+```bibtex
+@inproceedings{phobert,
+  title     = {{PhoBERT: Pre-trained language models for Vietnamese}},
+  author    = {Dat Quoc Nguyen and Anh Tuan Nguyen},
+  booktitle = {Findings of the Association for Computational Linguistics: EMNLP 2020},
+  year      = {2020},
+  pages     = {1037--1042}
+}
+```