metadata
license: apache-2.0
datasets:
- oeg/CelebA_Sent2Vect_Sp
language:
- es
tags:
- CelebA
- Spanish
- celebFaces Attributes
Sent2vec trained with data from the descriptive text corpus of the CelebA dataset
Overview
- Language: Spanish
- Data: CelebA_Sent2vec_Sp.
- Architecture: Sent2vec
Description
Sent2vec can be used directly for English texts. However, since this work is used with Spanish text, it has been necessary to train it previously using the generated corpus (in this respository) with the following process:
- Initial preprocessing of the Spanish corpus. For this purpose, a new file has been developed in which each of the entries of the original corpus is saved and the other components, such as the names of the image it describes and symbols, are removed. A total of 192,209 sentences are available for training.
- Apply a second pre-processing consisting of removing accents. stopwords and connectors were retained as part of
- the sentence structure during training.
- Configure the libraries, e.g., Sent2vec and FastText, and the parameters. The parameters have been set empirically, being: 4,800 feature vector dimension, 5,000 epochs, 200 threads, 2 n-grams, and 0.05 learning rate.
How to use
Licensing information
This model is available under the Apache License 2.0.
Citation information
Citing: If you used Sent2vec+CelebA model in your work, please cite the ????:
Autors
Universidad Nacional de Ingeniería, Ontology Engineering Group, Universidad Politécnica de Madrid.
Contributors
See the full list of contributors here.