A Large-Scale Multilingual Disambiguation of Glosses

A multilingual large-scale corpus of disambiguated glosses coming from different resources integrated in BabelNet such as Wikipedia, Wiktionary, WordNet, OmegaWiki and Wikidata.
Almost 250 milion sense-annotations for both concepts and named entities are provided. In total, over 35 million definitions have been disambiguated for 256 languages.
All definitions have been automatically disambiguated by exploiting at best their cross-language and cross-resource complementarities using Babelfy, a state-of-the-art multilingual Word Sense Disambiguation-Entity Linking system, and the semantic vector representations of NASARI.

Downloads

We release two different versions of the corpus, both stored in easy-to-process XML files divided by language and resource. The first version (“full”) has been fully disambiguated for all content words and named entities with an estimated precision above 75% for most languages. The second version (“refined”) has a reduced coverage (around 65% for all content words and 75% for noun instances) but a higher precision (estimated above 90%). Please find more information in the README file.

Full version of the corpus. Download: [4.2GB]

Refined version of the corpus. Download: [3.5GB]

Additionally, we release the English annotations of the refined version in NIF format.

NIF version of the corpus. Download: [4.3GB]

Reference paper

When using these data, please refer to the following paper:

José Camacho-Collados, Claudio Delli Bovi, Alessandro Raganato and Roberto Navigli.
A Large-Scale Multilingual Disambiguation of Glosses. [paper] [bib] [slides]
In Proceedings of LREC 2016, Portorož, Slovenia, 23-28 May 2016, pp. 1701-1708.

Contact

Should you have any enquiries about any of the resources, please contact José Camacho Collados (collados [at] di.uniroma1 [dot] it), Claudio Delli Bovi (dellibovi [at] di.uniroma1 [dot] it), Alessandro Raganato (raganato [at] di.uniroma1 [dot] it) or Roberto Navigli (navigli [at] di.uniroma1 [dot] it).

Last update: 29 Nov 2016

A Large-Scale Multilingual Disambiguation of Glosses is an output of the MultiJEDI ERC Starting Grant No. 259234 and it is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 License.