Monolingual and Cross-lingual Word Similarity Datasets

We release two monolingual word similarity datasets for Spanish and Farsi (Persian) obtained from the standard RG-65 dataset (Rubenstein and Goodenough, 1965). The guidelines for the creation of these datasets (in Spanish and Farsi) are also provided.

Additionally we provide fifteen cross-lingual datasets including English (EN), French (FR), German (DE), Spanish (ES), Portuguese (PT) and Farsi (FA) languages.

Shared task: Check the SemEval-2017 task on Multilingual and Cross-lingual Semantic Word Similarity. Data in English, German, Italian, Spanish and Farsi.

Datasets and Resources

Download the whole package including a README file.

Or, alternatively, download a specific dataset:


We also release three cross-lingual datasets based on the Multilingual WordSim353 released by Leviant and Reichart (2015) for English-German, English-Italian and German-Italian.
To obtain the original monolingual MWS353 data, please visit http://technion.ac.il/~ira.leviant/MultilingualVSMdata.html

Reference paper

When using these resources, please refer to the following paper:

José Camacho-Collados, Mohammad Taher Pilehvar and Roberto Navigli. A Framework for the Construction of Monolingual and Cross-lingual Word Similarity Datasets.
In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics (ACL), Short Papers, Beijing, China, July 27-29, 2015.

Contact

Should you have any enquiries about any of the resources, please contact José Camacho Collados (collados [at] di.uniroma1 [dot] it) or Mohammad Taher Pilehvar (pilehvar [at] di.uniroma1 [dot] it).

Other references

Original English RG-65 word similarity dataset: Herbert Rubenstein and John B. Goodenough. 1965. Contextual correlates of synonymy. Communications of the ACM, 8(10):627-633.

French RG-65 dataset: Colette Joubarne and Diana Inkpen. 2011. Comparison of semantic similarity for different languages using the Google n-gram corpus and second-order co-occurrence measures. In Advances in Artificial Intelligence, pages 216-221. Perth, Australia.

German RG-65 dataset: Iryna Gurevych. 2005. Using the structure of a conceptual network in computing semantic relatedness. In Proceedings of IJCNLP, pages 767-778. Jeju Island, Korea.

Portuguese RG-65 dataset: Roger Granada, Cassia Trojahn, and Renata Vieira. 2014. Comparing semantic relatedness between word pairs in Portuguese using Wikipedia. In Computational Processing of the Portuguese Language, pages 170-175. Sao Carlos/SP, Brazil.

Construction of a French-English dataset: Alistair Kennedy and Graeme Hirst. 2012. Measuring semantic relatedness across languages. In Proceedings of xLiTe: Cross-Lingual Technologies Workshop at the Neural Information Processing Systems Conference, pages 1-6, Lake Tahoe, USA.

Baselines for English, French, German, and three of the cross-lingual datasets : José Camacho-Collados, Mohammad Taher Pilehvar and Roberto Navigli. 2015. A Unified Multilingual Semantic Representation of Concepts. Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics (ACL 2015), Beijing, China, 26-31 July 2015.

Multilingual WS353: Ira Leviant, Roi Reichart. 2015. Judgment Language Matters: Multilingual Vector Space Models for Judgment Language Aware Lexical Semantics. Preprint pubslished on arXiv. Arxiv:1508.00106


Last update: 25 Nov. 2015 by José Camacho Collados