metadata

license: apache-2.0
datasets:
  - oeg/CelebA_Sent2Vect_Sp
language:
  - es
tags:
  - CelebA
  - Spanish
  - celebFaces Attributes

Sent2vec trained with data from the descriptive text corpus of the CelebA dataset

Overview

Language: Spanish
Data: CelebA_Sent2vec_Sp.
Architecture: Sent2vec

Description

Sent2vec can be used directly for English texts. However, since this work is used with Spanish text, it has been necessary to train it previously using the generated corpus (in this respository) with the following process:

Initial preprocessing of the Spanish corpus. For this purpose, a new file has been developed in which each of the entries of the original corpus is saved and the other components, such as the names of the image it describes and symbols, are removed. A total of 192,209 sentences are available for training.
Apply a second pre-processing consisting of removing accents. stopwords and connectors were retained as part of
the sentence structure during training.
Configure the libraries, e.g., Sent2vec and FastText, and the parameters. The parameters have been set empirically, being: 4,800 feature vector dimension, 5,000 epochs, 200 threads, 2 n-grams, and 0.05 learning rate.

How to use

Licensing information

This model is available under the Apache License 2.0.

Citation information

Citing: If you used Sent2vec+CelebA model in your work, please cite the ????:

Autors

Universidad Nacional de Ingeniería, Ontology Engineering Group, Universidad Politécnica de Madrid.

Contributors

See the full list of contributors here.