A Large-Scale Multilingual Disambiguation of Glosses
José Camacho Collados, Claudio Delli Bovi, Alessandro Raganato, and Roberto Navigli
This package contains a multilingual corpus of disambiguated definitions. There are two versions of the corpus: "complete" and "high-precision".
We include definitions coming from four different resources: "WordNet", "Open Multilingual WordNet", "Wiktionary", "Wikipedia", "Wikipedia
disambiguation pages", "OmegaWiki" and "WikiData".
256 languages are included, each one stored in an individual XML file starting with the respective ISO language code (two or three letters
representing the language).
Find below more details of the XML format of the disambiguated glosses files. Please read the reference paper for a more detailed description
of the resource.
Each file contains a list of "definition" tags, with the respective "id" (e.g. page titles in Wikipedia, offsets in WordNet) as attribute.
Then, each definition is composed by plain text (which is the original definition as appears in the given resource) and annotations.
The "annotation" tag refer to the sense-annotation provided as a result of our disambiguation scheme.
Each annotation includes its disambiguated BabelNet id and has four (or five) attributes:
- "source": it indicates whether the disambiguation has been performed by "BABELFY", the Most Common Sense ("MCS")
heuristic (only in the complete version of the corpus) or "NASARI" (only in the high-precision version of the corpus).
- "anchor": it corresponds to the exact surface form match found within the definition.
- "bfScore": it corresponds to the Babelfy score.
- "coherenceScore": it corresponds to the coherence score.
- "nasariScore": it corresponds to the NASARI score (only for the high-precision annotations).
- If the annotation source is MCS, both "bfScore" and "coherenceScore" are not shown and are always set to "--".
- In the high-precision version of the corpus, if "nasariScore" could not be computed (it occurs for some "BABELFY" annotations), the score
indicated is "--".
- If there are overlapping mentions in the high-precision version of the corpus (see below example with "President", "President of the
United States", and "16th President of the United States"; and "American Civil War", "Civil War", and "War"), we recommend to use the
largest mention, which is usually the most specific. In the case of the example, we would keep "16th President of the United States" and
"American Civil War". For the same case in the complete version of the corpus, we recommend to use the mention with a higher "coherenceScore".
See below two examples of two definitions of "Palaeochiropteryx" and "Abraham Lincoln" coming from, respectively, Wikipedia and WordNet as shown in the English XML files.
The first definition is shown as in the complete version of the corpus and the second one as in the high-precision version of the corpus:
Palaeochiropteryx is an extinct genus of bat from the Middle Eocene of Europe.
16th President of the United States; saved the Union during the American Civil War and emancipated the slaves; was assassinated by Booth (1809-1865)
For more information on our disambiguated corpus of glosses, please read the following paper.
When using these resources, please refer to the following paper:
José Camacho-Collados, Claudio Delli Bovi, Alessandro Raganato and Roberto Navigli.
A Large-Scale Multilingual Disambiguation of Glosses.
In Proceedings of the 10th Language Resources and Evaluation Conference (LREC),
Portoroz, Slovenia, May 23-28, 2016
If you have any enquiries about any of the resources, please contact José Camacho Collados (collados [at] di.uniroma1 [dot] it),
Claudio Delli Bovi (dellibovi [at] di.uniroma1 [dot] it), Alessandro Raganato (raganato [at] di.uniroma1 [dot] it) or Roberto Navigli (navigli [at] di.uniroma1 [dot] it).
A Large-Scale Multilingual Disambiguation of Glosses is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 License.