Monolingual and Cross-lingual Word Similarity Datasets
We release two monolingual word similarity datasets for Spanish
and Farsi (Persian) obtained from the standard RG-65 dataset (Rubenstein and Goodenough, 1965).
The guidelines for the creation of these datasets
(in Spanish and Farsi) are also provided.
Additionally we provide fifteen cross-lingual datasets including English (EN), French (FR), German (DE), Spanish (ES), Portuguese (PT) and Farsi (FA) languages.
@inproceedings{camacho2015framework,
title={A Framework for the Construction of Monolingual and Cross-lingual Word Similarity Datasets},
author={Camacho-Collados, Jos{\'e} and Pilehvar, Mohammad Taher and Navigli, Roberto},
booktitle={Proceedings of ACL (2)},
pages={1--7},
year={2015}
}
Contact
Should you have any enquiries about any of the resources, please contact José Camacho Collados (collados [at] di.uniroma1 [dot] it)
or Mohammad Taher Pilehvar (pilehvar [at] di.uniroma1 [dot] it).
Other references
Original English RG-65 word similarity dataset: Herbert Rubenstein and John B. Goodenough. 1965. Contextual correlates of synonymy. Communications of the ACM, 8(10):627-633.
French RG-65 dataset: Colette Joubarne and Diana Inkpen. 2011. Comparison of semantic similarity for different languages
using the Google n-gram corpus and second-order co-occurrence measures. In Advances in Artificial Intelligence, pages 216-221. Perth, Australia.
German RG-65 dataset: Iryna Gurevych. 2005. Using the structure of a conceptual network in computing semantic relatedness.
In Proceedings of IJCNLP, pages 767-778. Jeju Island, Korea.
Portuguese RG-65 dataset: Roger Granada, Cassia Trojahn, and Renata Vieira. 2014. Comparing semantic relatedness between
word pairs in Portuguese using Wikipedia. In Computational Processing of the Portuguese Language, pages 170-175. Sao Carlos/SP, Brazil.
Construction of a French-English dataset: Alistair Kennedy and Graeme Hirst. 2012. Measuring semantic relatedness across languages. In Proceedings of xLiTe:
Cross-Lingual Technologies Workshop at the Neural Information Processing Systems Conference, pages 1-6, Lake Tahoe, USA.
Baselines for English, French, German, and three of the cross-lingual datasets : José Camacho-Collados, Mohammad Taher Pilehvar and Roberto Navigli. 2015.
A Unified Multilingual Semantic Representation of Concepts. Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics (ACL 2015),
Beijing, China, 26-31 July 2015.
Multilingual WS353: Ira Leviant, Roi Reichart. 2015. Judgment Language Matters: Multilingual Vector Space Models for Judgment
Language Aware Lexical Semantics. Preprint pubslished on arXiv. Arxiv:1508.00106
Last update: 25 Nov. 2016 by José Camacho Collados