Monolingual and Cross-lingual Word Similarity Datasets

We release two monolingual word similarity datasets for Spanish and Farsi (Persian) obtained from the standard RG-65 dataset (Rubenstein and Goodenough, 1965). The guidelines for the creation of these datasets (in Spanish and Farsi) are also provided.

Additionally we provide fifteen cross-lingual datasets including English (EN), French (FR), German (DE), Spanish (ES), Portuguese (PT) and Farsi (FA) languages.

Shared task: Check the SemEval-2017 task on Multilingual and Cross-lingual Semantic Word Similarity. Data in English, German, Italian, Spanish and Farsi.

Datasets and Resources

Download the whole package including a README file.

Or, alternatively, download a specific dataset:

The RG-65 dataset in Spanish and Farsi.

Cross-lingual datasets:

	FR	DE	ES	PT	FA
EN	EN-FR	EN-DE	EN-ES	EN-PT	EN-FA
FR	-	FR-DE	FR-ES	FR-PT	FR-FA
DE	-	-	DE-ES	DE-PT	DE-FA
ES	-	-	-	ES-PT	ES-FA
PT	-	-	-	-	PT-FA

New! We also release cross-lingual datasets based on WordSim353 and SimLex-999 for English-German, English-Italian and German-Italian. Download:

The original monolingual datasets, released by Leviant and Reichart (2015), can be downloaded at http://leviants.com/ira.leviant/MultilingualVSMdata.html

Reference paper

When using these resources, please refer to the following paper:

José Camacho-Collados, Mohammad Taher Pilehvar and Roberto Navigli.
A Framework for the Construction of Monolingual and Cross-lingual Word Similarity Datasets.
In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics (ACL), Short Papers, Beijing, China, July 27-29, 2015.

@inproceedings{camacho2015framework,
  title={A Framework for the Construction of Monolingual and Cross-lingual Word Similarity Datasets},
  author={Camacho-Collados, Jos{\'e} and Pilehvar, Mohammad Taher and Navigli, Roberto},
  booktitle={Proceedings of ACL (2)},
  pages={1--7},
  year={2015}
}

Contact

Should you have any enquiries about any of the resources, please contact José Camacho Collados (collados [at] di.uniroma1 [dot] it) or Mohammad Taher Pilehvar (pilehvar [at] di.uniroma1 [dot] it).

Other references

Original English RG-65 word similarity dataset: Herbert Rubenstein and John B. Goodenough. 1965. Contextual correlates of synonymy. Communications of the ACM, 8(10):627-633.

French RG-65 dataset: Colette Joubarne and Diana Inkpen. 2011. Comparison of semantic similarity for different languages using the Google n-gram corpus and second-order co-occurrence measures. In Advances in Artificial Intelligence, pages 216-221. Perth, Australia.

German RG-65 dataset: Iryna Gurevych. 2005. Using the structure of a conceptual network in computing semantic relatedness. In Proceedings of IJCNLP, pages 767-778. Jeju Island, Korea.

Portuguese RG-65 dataset: Roger Granada, Cassia Trojahn, and Renata Vieira. 2014. Comparing semantic relatedness between word pairs in Portuguese using Wikipedia. In Computational Processing of the Portuguese Language, pages 170-175. Sao Carlos/SP, Brazil.

Construction of a French-English dataset: Alistair Kennedy and Graeme Hirst. 2012. Measuring semantic relatedness across languages. In Proceedings of xLiTe: Cross-Lingual Technologies Workshop at the Neural Information Processing Systems Conference, pages 1-6, Lake Tahoe, USA.

Baselines for English, French, German, and three of the cross-lingual datasets : José Camacho-Collados, Mohammad Taher Pilehvar and Roberto Navigli. 2015. A Unified Multilingual Semantic Representation of Concepts. Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics (ACL 2015), Beijing, China, 26-31 July 2015.

Multilingual WS353: Ira Leviant, Roi Reichart. 2015. Judgment Language Matters: Multilingual Vector Space Models for Judgment Language Aware Lexical Semantics. Preprint pubslished on arXiv. Arxiv:1508.00106

Last update: 25 Nov. 2016 by José Camacho Collados