A multilingual large-scale corpus of disambiguated glosses coming from different resources integrated in BabelNet such as Wikipedia, Wiktionary, WordNet, OmegaWiki and Wikidata.
Almost 250 milion sense-annotations for both concepts and named entities are provided. In total, over 35 million definitions have been disambiguated for 256 languages.
All definitions have been automatically disambiguated by exploiting at best their cross-language and cross-resource complementarities using Babelfy, a state-of-the-art multilingual Word Sense Disambiguation-Entity Linking system, and the semantic vector representations of NASARI.
Downloads
We release two different versions of the corpus, both stored in easy-to-process XML files divided by language and resource. The first version (“full”) has been fully disambiguated for all content words and named entities with an estimated precision above 75% for most languages. The second version (“refined”) has a reduced coverage (around 65% for all content words and 75% for noun instances) but a higher precision (estimated above 90%).
Please find more information in the README file.