A Large-Scale Multilingual Disambiguation of Glosses

A multilingual large-scale corpus of disambiguated glosses coming from different resources integrated in BabelNet such as Wikipedia, Wiktionary, WordNet, OmegaWiki and Wikidata.
Almost 250 milion sense-annotations for both concepts and named entities are provided. In total, over 35 million definitions have been disambiguated for 256 languages.
All definitions have been automatically disambiguated by exploiting at best their cross-language and cross-resource complementarities using Babelfy, a state-of-the-art multilingual Word Sense Disambiguation-Entity Linking system, and the semantic vector representations of NASARI.


We release two different versions of the corpus, both stored in easy-to-process XML files divided by language and resource. The first version (“full”) has been fully disambiguated for all content words and named entities with an estimated precision above 75% for most languages. The second version (“refined”) has a reduced coverage (around 65% for all content words and 75% for noun instances) but a higher precision (estimated above 90%). Please find more information in the README file. Additionally, we release the English annotations of the refined version in NIF format.

Reference paper

When using these data, please refer to the following paper:

José Camacho-Collados, Claudio Delli Bovi, Alessandro Raganato and Roberto Navigli.
A Large-Scale Multilingual Disambiguation of Glosses. [paper] [bib] [slides]
In Proceedings of LREC 2016, Portorož, Slovenia, 23-28 May 2016, pp. 1701-1708.


Should you have any enquiries about any of the resources, please contact José Camacho Collados (collados [at] di.uniroma1 [dot] it), Claudio Delli Bovi (dellibovi [at] di.uniroma1 [dot] it), Alessandro Raganato (raganato [at] di.uniroma1 [dot] it) or Roberto Navigli (navigli [at] di.uniroma1 [dot] it).

Last update: 29 Nov 2016