===================================================================================================================================================================== EuroSense: Automatic Harvesting of Multilingual Sense Annotations from Parallel Text Claudio Delli Bovi, Jose Camacho Collados, Alessandro Raganato, and Roberto Navigli ===================================================================================================================================================================== This package contains EuroSense, a multilingual sense-annotated resource automatically built via the joint disambiguation of the Europarl parallel corpus in 21 languages, with almost 123 million sense annotations for over 155 thousand distinct concepts and entities, drawn from the multilingual sense inventory of BabelNet. There are two versions of the corpus: "high-coverage" and "high-precision", both stored in XML files (with UTF-8 encoding). For a more detailed description of EuroSense please refer to Section 3 (Building EuroSense) of the reference paper. Find below more details about the XML format of EuroSense. ===================================================================================================================================================================== FORMAT OF THE XML FILES ===================================================================================================================================================================== Each version contains a list of "sentence" tags, with an incremental "id" (starting from "0") as attribute. Then, each sentence contains a list of "text" tags, corresponding to the tokenized texts of the sentence in a given language (the ISO code of the language is encoded in the "lang" attribute). Finally, the "annotations" tag includes all the sense annotations provided as a result of EuroSense's disambiguation scheme. Each annotation includes its disambiguated BabelNet id and has four (or six) attributes: - "lang": The language of the annotation (ISO code); - "type"*: Whether the disambiguation was performed by "BABELFY" (first stage of the pipeline) or "NASARI" (second stage of the pipeline); - "anchor": The exact surface form match found within the text; - "lemma": The lemmatized form of the annotation's anchor; - "coherenceScore": The coherence score (cf. Section 3.1 of the reference paper); - "nasariScore"*: The NASARI score (cf. Section 3.2 of the reference paper). [ * available only in the high-precision version of the corpus ] Notes: - In the high-precision version of the corpus, "nasariScore" is always set to "--" when the annotation has type "BABELFY"; - If there are overlapping mentions in the high-precision version of the corpus (for example "European", "Commission" and "European Commission"), we recommend to use the longest mention ("European Commission" in this case), which is usually the most specific. For the same case in the high-coverage version, we recommend to use the mention with a higher "coherenceScore". See below an XML excerpt of the high-precision version of EuroSense: Es wurde schließlich von einer Neufestlegung eines institutionellen politischen Projekts gesprochen , und auch Präsident Santer sprach davon . Finally , Mr President , Mr Santer among others spoke of taking a fresh look at institutional policy . Endelig talte vi om , hr. formand - og det gjorde hr . Santer også - en ny definition af et politisk institutionsprojekt . Τέλος , κύριε Πρόεδρε , έγινε λόγος - το έκανε και ο κύριος Santer - για τον επαναπροσδιορισμού ενός θεσμικού πολιτικού σχεδίου . ... Por último , señor Presidente , se ha hablado -lo ha hecho también el Sr. Santer- de la redefinición de un proyecto político institucional . bn:00049573n bn:00090943v bn:00064234n bn:00090943v bn:00055346n bn:00064234n bn:00055346n bn:00090943v bn:00063330n bn:00001533n bn:00017517n bn:00055346n ... bn:00055346n bn:00064234n bn:00090943v bn:00082285v bn:00055346n bn:00049573n ===================================================================================================================================================================== REFERENCE PAPER ===================================================================================================================================================================== When using EuroSense, please refer to the following paper: Claudio Delli Bovi, Jose Camacho-Collados, Alessandro Raganato and Roberto Navigli. EuroSense: Automatic Harvesting of Multilingual Sense Annotations from Parallel Text. Proceedings of 55th annual meeting of the Association for Computational Linguistics (ACL 2017), Vancouver, Canada, 30 July-4 August 2017. ===================================================================================================================================================================== CONTACT ===================================================================================================================================================================== If you have any enquiries about any of the resources, please contact: - Claudio Delli Bovi (dellibovi [at] di.uniroma1 [dot] it) - Jose Camacho Collados (collados [at] di.uniroma1 [dot] it) - Alessandro Raganato (raganato [at] di.uniroma1 [dot] it) - Roberto Navigli (navigli [at] di.uniroma1 [dot] it). ===================================================================================================================================================================== LICENSE ===================================================================================================================================================================== EuroSense is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 License.