================================================================================================================================= A FRAMEWORK FOR THE CONSTRUCTION OF MONOLINGUAL AND CROSS-LINGUAL WORD SIMILARITY DATASETS José Camacho Collados, Mohammad Taher Pilehvar and Roberto Navigli ================================================================================================================================= This package contains two directories (Monolingual and Cross-lingual): * Monolingual - Datasets: It contains two monolingual word similarity datasets. -"rg65_spanish.txt": Spanish version of RG-65 -"rg65_farsi.txt": Farsi (Persian) version of RG-65 - Guidelines: It contains the guidelines used for the construction of the monolingual datasets - "Similarity_Guidelines_ES.pdf": Annotation guidelines used for creating the Spanish word similarity dataset. - "Similarity_Guidelines_FA.pdf": Annotation guidelines used for creating the Farsi word similarity dataset. * Cross-lingual - Datasets: It contains fifteen cross-lingual word similarity datasets (15 different language pairs) based on RG-65, including English (EN), French (FR), German (DE), Spanish (ES), Portuguese (PT), and Farsi (FA) languages. - "cross-lingual_dataset_creation.py": the Python script for the automatic creation of cross-lingual datasets. See below the instructions on how to use it. ================================================================================================================================= FORMAT OF THE DATASETS ================================================================================================================================= Each line in these seventeen datasets (Spanish and Farsi monolingual datasets and fifteen cross-lingual datasets) are formatted as follows: word1word2similarity_score ================================================================================================================================= INSTRUCTION FOR AUTOMATICALLY CREATING CROSS-LINGUAL DATASETS ================================================================================================================================= Input: Two monolingual datasets previously aligned pair-wise (line by line), following the format indicated above. Output: The cross-lingual dataset in the same format, saved in the same directory. Intructions to run the Python script "cross-lingual_dataset_creation.py": The code takes the following parameters: path, file_1, file_2, and size_sim_scale path : Path of the monolingual datasets' directory and path where the cross-lingual dataset will be created (by default it is the same path). file_1 : File name of the first dataset. file_2 : File name of the second dataset. size_sim_scale : Size of the similarity scale (In RG-65, for example, the size of the similarity scale is 4). Run it in the terminal by typing the following expression: $ python cross-lingual_dataset_creation.py path file_1 file_2 size_sim_scale It will create the new cross-lingual dataset in the same directory (path). Example of usage: $ python cross-lingual_dataset_creation.py /home/Monolingual/Datasets/ rg65_spanish.txt rg65_farsi.txt 4 This creates a cross-lingual dataset in "/home/Monolingual/Datasets/" with the name "cross_rg65_spanish_rg65_farsi.txt". ================================================================================================================================= REFERENCE PAPER ================================================================================================================================= When using these resources, please refer to the following paper (included in the package as "ACL_Framework_Word_Similarity.pdf"): José Camacho-Collados, Mohammad Taher Pilehvar and Roberto Navigli. A Framework for the Construction of Monolingual and Cross-lingual Word Similarity Datasets. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics (ACL 2015), Beijing, China, July 27-29, 2015. ================================================================================================================================= CONTACT ================================================================================================================================= If you have any enquiries about any of the resources, please contact José Camacho Collados (collados [at] di.uniroma1 [dot] it) or Mohammad Taher Pilehvar (pilehvar [at] di.uniroma1 [dot] it). ================================================================================================================================= This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 License