============================================================================================================================================================================================= EMBEDDINGS WORDS AND SENSES TOGETHER VIA KNOWLEDGE-ENHANCED TRAINING Massimiliano Mancini*, Jose Camacho Collados*, Ignacio Iacobacci and Roberto Navigli ============================================================================================================================================================================================= This package contains the implementation of SW2V. In particular, this is a modification of the word2vec software by Mikolov et al. (2013), allowing the joint training of word and sense/synset embeddings. Website: http://lcl.uniroma1.it/sw2v ============================================================================================================================================================================================= INPUT CORPUS ============================================================================================================================================================================================= The program takes a partially or fully disambiguated and tokenized corpus as input (the word-sense connectivity algorithm will be integrated at a later stage), where each word is separated from its associated synset/s (if any) by the separator "%&$" as in the example below: " Facilities_Management%&$bn:03245147n continues to investigate%&$bn:00087655v temperature%&$bn:00076457n concerns%&$bn:00021556n%&$bn:00016013n on a case-by-case basis . " A small sample preprocessed corpus is included in the package (sample_preprocessed_corpus.txt). We have included this corpus with the only purpose of testing the code and for a better understanding of the input format, but it is not large enough to obtain meaningful representations. The preprocessed Wikipedia corpus can be downloaded in the website (wikipedia_preprocessed_SW2V.txt). The synsets may come from any semantic network or sense inventory, given that their "ids" are composed of 12 characters (the code will be further extended to any possible length). ============================================================================================================================================================================================= USAGE ============================================================================================================================================================================================= To compile, type the following command in the terminal: gcc sw2v.c -o sw2v -lm -pthread -O3 -march=native -Wall -funroll-loops -Wno-unused-result Then, type the following command to run our model (sample run with the configuration used in the experiments of the paper): ./sw2v -train sample_preprocessed_corpus.txt -output sense_embeddings.bin -cbow 1 -size 300 -window 8 -negative 0 -hs 1 -threads 12 -binary 1 -iter 5 -update 0 -senses 1 -synsets_input 2 -synsets_target 1 ============================================================================================================================================================================================= PARAMETERS ============================================================================================================================================================================================= The parameters that can be specified are the ones of the original word2vec (see https://code.google.com/p/word2vec/ for more information about them). In this first implementation we have included both the CBOW and Skip-Gram model, but it has only been tested for CBOW (cbow 1). In SW2V there are 4 additional parameters (more information about the parameters on the reference paper): Parameter Possible values Function synsets_input 0,1,2 Specify the input layer configuration: only words (0), only senses/synsets(2) or both (1). synsets_target 0,1,2 Specify the output layer configuration: only words (0), only senses/synsets(2) or both (1). senses 0,1 Specify whether to learn synset (0) or sense (1) embeddings. - A synset refers to a concept or an entity (e.g. bn:00005054n referring to the apple fruit concept) while a sense contains the concept plus a lexicalization (e.g. apple_bn:00005054n). - update 0,1 Specify whether to update only words that have a linked sense (0) or every word (1). - If update=0, word and sense embeddings appear closer in the space, but the word embeddings may not be as accurate as they may contain fewer training instances. - ============================================================================================================================================================================================= REFERENCE PAPER ============================================================================================================================================================================================= When using these resources, please refer to the following paper: Massimiliano Mancini, Jose Camacho-Collados, Ignacio Iacobacci and Roberto Navigli. Embedding Words and Senses Together via Joint Knowledge-Enhanced Training. In Proceedings of CoNLL 2017, Vancouver, Canada. ============================================================================================================================================================================================= WORD2VEC ORIGINAL README ============================================================================================================================================================================================= Tools for computing distributed representation of words ------------------------------------------------------ We provide an implementation of the Continuous Bag-of-Words (CBOW) and the Skip-gram model (SG), as well as several demo scripts. Given a text corpus, the word2vec tool learns a vector for every word in the vocabulary using the Continuous Bag-of-Words or the Skip-Gram neural network architectures. The user should to specify the following: - desired vector dimensionality - the size of the context window for either the Skip-Gram or the Continuous Bag-of-Words model - training algorithm: hierarchical softmax and / or negative sampling - threshold for downsampling the frequent words - number of threads to use - the format of the output word vector file (text or binary) Usually, the other hyper-parameters such as the learning rate do not need to be tuned for different training sets. The script demo-word.sh downloads a small (100MB) text corpus from the web, and trains a small word vector model. After the training is finished, the user can interactively explore the similarity of the words. More information about the scripts is provided at https://code.google.com/p/word2vec/