oeg
/

esgg commited on
Commit
e9eb40d
·
1 Parent(s): 09f5684

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +48 -0
README.md CHANGED
@@ -1,3 +1,51 @@
1
  ---
2
  license: cc-by-4.0
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: cc-by-4.0
3
  ---
4
+ #Software Benchmark SCIBERT model.
5
+ This model is a fine-tuned version of the [SCIBERT](https://huggingface.co/allenai/scibert_scivocab_uncased) model on a dataset built based on the corpora SoMESCi and Softcite.
6
+
7
+ The objective of this model is to extract software mentions from scientific texts in the BIO domain.
8
+
9
+ The training code can be found on [Github](https://github.com/oeg-upm/software_mentions_benchmark).
10
+
11
+ ## Corpus
12
+
13
+ The corpus have been built using two corpora in software mentions.
14
+ * SoMESCi [1]. We have used the corpus uploaded to [Github](https://github.com/dave-s477/SoMeSci/tree/9f17a43f342be026f97f03749457d4abb1b01dbf/PLoS_sentences), more specifically, the corpus created with sentences.
15
+ * Softcite [2]. This project has published another corpus for software mentions, which is also available on [Github](https://github.com/howisonlab/softcite-dataset/tree/master/data/corpus). We have to note that we only use the annotations from bio domain.
16
+
17
+ To build this corpus, we have removed the annotations of other entities such as version, url and those which are related with the relation of teh entity with the text.
18
+
19
+ To reconciliate both corpora, we have mapping the labels of both corpora. Also, some decisions about the annotations have been taken, for example, in the case of Microsoft Excel, we have decided to annotate Excel as software mention, not the whole text.
20
+
21
+ ## Training
22
+
23
+ The corpus have been splitted in a 70-30 proportion for training and testing.
24
+
25
+ The training code can be found on [Github](https://github.com/oeg-upm/software_mentions_benchmark).
26
+
27
+ The results are:
28
+ * Precision: 0.823
29
+ * Recall: 0.814
30
+ * F1-score: 0.819
31
+
32
+ ## Acknoledgements
33
+
34
+ This is a work done thank to the effort of other projects:
35
+ * Softcite
36
+ * SoMESCi
37
+ * [SCIBERT](https://huggingface.co/allenai/scibert_scivocab_uncased)
38
+
39
+ ## Authors
40
+
41
+ * Esteban González Guardia
42
+ * Daniel Garijo Verdejo
43
+
44
+ ## Contributors
45
+
46
+ <kbd><img src="https://raw.githubusercontent.com/oeg-upm/TINTO/main/assets/logo-oeg.png" alt="Ontology Engineering Group" width="100"></kbd>
47
+ <kbd><img src="https://raw.githubusercontent.com/oeg-upm/TINTO/main/assets/logo-upm.png" alt="Universidad Politécnica de Madrid" width="100"></kbd>
48
+
49
+ ## References
50
+ [1] Schindler, D., Bensmann, F., Dietze, S., & Krüger, F. (2021, October). Somesci-A 5 star open data gold standard knowledge graph of software mentions in scientific articles. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management (pp. 4574-4583).
51
+ [2] Du, C., Cohoon, J., Lopez, P., & Howison, J. (2021). Softcite dataset: A dataset of software mentions in biomedical and economic research publications. Journal of the Association for Information Science and Technology, 72(7), 870-884.