File size: 4,524 Bytes
9773169
03a5151
 
 
4186e94
a20f2bf
 
03a5151
 
 
 
 
 
 
 
4186e94
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9773169
a20f2bf
03a5151
9773169
d66ed25
 
a20f2bf
a5bc97a
 
 
 
 
 
9773169
03a5151
9773169
03a5151
 
 
9773169
64145b7
9773169
16f0a37
 
03a5151
 
9773169
d66ed25
 
 
 
 
 
 
 
 
 
 
 
 
 
a20f2bf
03a5151
a5bc97a
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
---
base_model: FacebookAI/xlm-roberta-base
datasets:
- StyleDistance/mstyledistance_training_triplets
library_name: sentence-transformers
pipeline_tag: feature-extraction
license: mit
tags:
- datadreamer
- datadreamer-0.35.0
- synthetic
- sentence-transformers
- feature-extraction
- sentence-similarity
widget:
- example_title: Example 1
  source_sentence: 彼は技術的な複雑さと格闘し、彼の作品は驚くべき視覚的緊張を生み出した。
  sentences:
  - Serviste mariscos frescos en el condado de Middlesex y áreas circundantes.
  - Él sirvió mariscos frescos en el condado de Middlesex y áreas circundantes.
- example_title: Example 2
  source_sentence: Bien sûr, ils termineront la construction du pont en une semaine.
  sentences:
  - Oh, you mean when I single-handedly tackled that bespoke headboard project?
  - Remember when I completed that bespoke headboard project on my own?
- example_title: Example 3
  source_sentence: 我将使用有限的色调和小尺寸进行像素艺术的简化和风格化设计。
  sentences:
  - Я ценю ТТ-пистолет за его огневую мощь; его проникающая способность впечатляет
    меня.
  - 你将使用有限的色调和小尺寸进行像素艺术的简化和风格化设计。
---

# Model Card

This repository contains the model introduced in [mStyleDistance: Multilingual Style Embeddings and their Evaluation](https://hf.co/papers/2502.15168).

mStyleDistance is a **multilingual style embedding model** that aims to embed texts with similar writing styles closely and different styles far apart, regardless of content and regardless of language. You may find this model useful for stylistic analysis of multilingual text, clustering, authorship identfication and verification tasks, and automatic style transfer evaluation. The model is described in the paper [StyleDistance/mstyledistance](https://hf.co/papers/2502.15168).

This model is an multilingual version of the English-only [StyleDistance](https://huggingface.co/StyleDistance/styledistance) model.

## Training Data and Variants of StyleDistance

mStyleDistance was contrastively trained on [mSynthSTEL](https://huggingface.co/datasets/StyleDistance/msynthstel), a synthetically generated dataset of positive and negative examples of ~40 style features being used in text in 9 non-English languages. By utilizing this synthetic dataset, mStyleDistance is able to achieve stronger content-independence than other style embedding models currently available and is able to operate on multilingual text.

## Example Usage

```python3
from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim

model = SentenceTransformer('StyleDistance/mstyledistance') # Load model

input = model.encode("ÉL TIENE PROBLEMAS PARA LOGRAR LA TEMPERATURA ADECUADA PARA COCINAR LA GALLINA CORNISH.")
others = model.encode(["TOCARÁS LA GUITARRA CON TU AMIGO; SERÁ UNA EXCELENTE OPORTUNIDAD PARA MEJORAR TUS HABILIDADES MUSICALES.", "Él tiene problemas para lograr la temperatura adecuada para cocinar la gallina Cornish."])
print(cos_sim(input, others))
```

---
## Citation

```latex
@misc{qiu2025mstyledistancemultilingualstyleembeddings,
      title={mStyleDistance: Multilingual Style Embeddings and their Evaluation}, 
      author={Justin Qiu and Jiacheng Zhu and Ajay Patel and Marianna Apidianaki and Chris Callison-Burch},
      year={2025},
      eprint={2502.15168},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2502.15168}, 
}
```

---
## Trained with DataDreamer

This model was trained with a synthetic dataset with [DataDreamer 🤖💤](https://datadreamer.dev). The synthetic dataset card and model card can be found [here](datadreamer.json). The training arguments can be found [here](training_args.json).

---
#### Funding Acknowledgements

<small> This research is supported in part by the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA), via the HIATUS Program contract #2022-22072200005. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of ODNI, IARPA, or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for governmental purposes notwithstanding any copyright annotation therein. </small>