=============================================================================================================================================================================================
						 EMBEDDINGS WORDS AND SENSES TOGETHER
					           VIA KNOWLEDGE-ENHANCED TRAINING

				Massimiliano Mancini*, Jose Camacho Collados*, Ignacio Iacobacci and Roberto Navigli
=============================================================================================================================================================================================

This package contains the implementation of SW2V. In particular, this is a modification of the word2vec software by Mikolov et al. (2013), allowing the joint training of word 
and sense/synset embeddings.

Website: http://lcl.uniroma1.it/sw2v

=============================================================================================================================================================================================
INPUT CORPUS
=============================================================================================================================================================================================

The program takes a partially or fully disambiguated and tokenized corpus as input (the word-sense connectivity algorithm will be integrated at a later stage), where each word 
is separated from its associated synset/s (if any) by the separator "%&$" as in the example below:

	" Facilities_Management%&$bn:03245147n continues to investigate%&$bn:00087655v temperature%&$bn:00076457n concerns%&$bn:00021556n%&$bn:00016013n on a case-by-case basis . " 

A small sample preprocessed corpus is included in the package (sample_preprocessed_corpus.txt). We have included this corpus with the only purpose of testing the code and for a 
better understanding of the input format, but it is not large enough to obtain meaningful representations. The preprocessed Wikipedia corpus can be downloaded in the website (wikipedia_preprocessed_SW2V.txt).

The synsets may come from any semantic network or sense inventory, given that their "ids" are composed of 12 characters (the code will be further extended to any possible length).

=============================================================================================================================================================================================
USAGE
=============================================================================================================================================================================================

To compile, type the following command in the terminal:

	gcc sw2v.c -o sw2v -lm -pthread -O3 -march=native -Wall -funroll-loops -Wno-unused-result

Then, type the following command to run our model (sample run with the configuration used in the experiments of the paper):

	./sw2v -train sample_preprocessed_corpus.txt -output sense_embeddings.bin -cbow 1 -size 300 -window 8 -negative 0 -hs 1 -threads 12 -binary 1 -iter 5 -update 0 -senses 1 
-synsets_input 2 -synsets_target 1

=============================================================================================================================================================================================
PARAMETERS
=============================================================================================================================================================================================

The parameters that can be specified are the ones of the original word2vec (see https://code.google.com/p/word2vec/ for more information about them).
In this first implementation we have included both the CBOW and Skip-Gram model, but it has only been tested for CBOW (cbow 1).

In SW2V there are 4 additional parameters (more information about the parameters on the reference paper):
	
		Parameter		Possible values		Function

		synsets_input		0,1,2			Specify the input layer configuration:  only words (0), only senses/synsets(2) or both (1).

		synsets_target		0,1,2			Specify the output layer configuration:  only words (0), only senses/synsets(2) or both (1).

		senses 			0,1			Specify whether to learn synset (0) or sense (1) embeddings.
								- A synset refers to a concept or an entity (e.g. bn:00005054n referring to the apple fruit concept) while a sense contains 
								the concept plus a lexicalization (e.g. apple_bn:00005054n). -

		update			0,1			Specify whether to update only words that have a linked sense (0) or every word (1).
								- If update=0, word and sense embeddings appear closer in the space, but the word embeddings may not be as accurate as they
								may contain fewer training instances. -


=============================================================================================================================================================================================
REFERENCE PAPER
=============================================================================================================================================================================================

When using these resources, please refer to the following paper:

	Massimiliano Mancini, Jose Camacho-Collados, Ignacio Iacobacci and Roberto Navigli. 
	Embedding Words and Senses Together via Joint Knowledge-Enhanced Training. 
	In Proceedings of CoNLL 2017, Vancouver, Canada. 


=============================================================================================================================================================================================
WORD2VEC ORIGINAL README
=============================================================================================================================================================================================

Tools for computing distributed representation of words
------------------------------------------------------

We provide an implementation of the Continuous Bag-of-Words (CBOW) and the Skip-gram model (SG), as well as several demo scripts.

Given a text corpus, the word2vec tool learns a vector for every word in the vocabulary using the Continuous
Bag-of-Words or the Skip-Gram neural network architectures. The user should to specify the following:
 - desired vector dimensionality
 - the size of the context window for either the Skip-Gram or the Continuous Bag-of-Words model
 - training algorithm: hierarchical softmax and / or negative sampling
 - threshold for downsampling the frequent words 
 - number of threads to use
 - the format of the output word vector file (text or binary)

Usually, the other hyper-parameters such as the learning rate do not need to be tuned for different training sets. 

The script demo-word.sh downloads a small (100MB) text corpus from the web, and trains a small word vector model. After the training
is finished, the user can interactively explore the similarity of the words.

More information about the scripts is provided at https://code.google.com/p/word2vec/