=================================================================================================================================
						A FRAMEWORK FOR THE CONSTRUCTION OF 
					MONOLINGUAL AND CROSS-LINGUAL WORD SIMILARITY DATASETS

				José Camacho Collados, Mohammad Taher Pilehvar and Roberto Navigli
=================================================================================================================================


This package contains two directories (Monolingual and Cross-lingual):


* Monolingual

	- Datasets: It contains two monolingual word similarity datasets. 
		-"rg65_spanish.txt": Spanish version of RG-65
		-"rg65_farsi.txt": Farsi (Persian) version of RG-65

	- Guidelines: It contains the guidelines used for the construction of the monolingual datasets
		- "Similarity_Guidelines_ES.pdf": Annotation guidelines used for creating the Spanish word similarity dataset. 
		- "Similarity_Guidelines_FA.pdf": Annotation guidelines used for creating the Farsi word similarity dataset.

* Cross-lingual

	- Datasets: It contains fifteen cross-lingual word similarity datasets (15 different language pairs) based on RG-65, 
		    including English (EN), French (FR), German (DE), Spanish (ES), Portuguese (PT), and Farsi (FA) languages. 

	- "cross-lingual_dataset_creation.py": the Python script for the automatic creation of cross-lingual datasets. 
						See below the instructions on how to use it.


=================================================================================================================================
FORMAT OF THE DATASETS
=================================================================================================================================

Each line in these seventeen datasets (Spanish and Farsi monolingual datasets and fifteen cross-lingual datasets) are formatted 
as follows:

word1<tab>word2<tab>similarity_score

=================================================================================================================================
INSTRUCTION FOR AUTOMATICALLY CREATING CROSS-LINGUAL DATASETS
=================================================================================================================================

Input: Two monolingual datasets previously aligned pair-wise (line by line), following the format indicated above.
Output: The cross-lingual dataset in the same format, saved in the same directory.

Intructions to run the Python script "cross-lingual_dataset_creation.py": 

The code takes the following parameters: path, file_1, file_2, and size_sim_scale
	path		: Path of the monolingual datasets' directory and path where the cross-lingual dataset will be created 
			  (by default it is the same path).
        file_1		: File name of the first dataset.
        file_2		: File name of the second dataset.
        size_sim_scale	: Size of the similarity scale (In RG-65, for example, the size of the similarity scale is 4).

Run it in the terminal by typing the following expression: 
	$ python cross-lingual_dataset_creation.py path file_1 file_2 size_sim_scale

It will create the new cross-lingual dataset in the same directory (path).

Example of usage: 
	$ python cross-lingual_dataset_creation.py /home/Monolingual/Datasets/ rg65_spanish.txt rg65_farsi.txt 4

This creates a cross-lingual dataset in "/home/Monolingual/Datasets/" with the name "cross_rg65_spanish_rg65_farsi.txt".


=================================================================================================================================
REFERENCE PAPER
=================================================================================================================================

When using these resources, please refer to the following paper (included in the package as "ACL_Framework_Word_Similarity.pdf"):

	José Camacho-Collados, Mohammad Taher Pilehvar and Roberto Navigli. 
	A Framework for the Construction of Monolingual and Cross-lingual Word Similarity Datasets. 
	In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics (ACL 2015), 
	Beijing, China, July 27-29, 2015. 


=================================================================================================================================
CONTACT
=================================================================================================================================
 
If you have any enquiries about any of the resources, please contact José Camacho Collados (collados [at] di.uniroma1 [dot] it) 
or Mohammad Taher Pilehvar (pilehvar [at] di.uniroma1 [dot] it).

=================================================================================================================================

This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 License