============================================================================================================================================== SEW: Automatic Construction and Evaluation of a Large Semantically Enriched Wikipedia Alessandro Raganato, Claudio Delli Bovi and Roberto Navigli ============================================================================================================================================== This package contains SEW, a sense-annotated corpus automatically built from Wikipedia. Two versions of SEW are available: - "complete": The complete version comprises annotations gathered from all the heuristics (including those with overlapping mentions) - "conservative": The conservative version includes only one sense annotation per tagged mention and no overlap. For more information please refer to Section 3 (Building a Semantically Enriched Wikipedia) of the reference paper. The sense inventory of SEW is BabelNet (http://babelnet.org), the largest multilingual encyclopedic dictionary and semantic network. NOTE: - Each Wikipedia article is stored in an individual XML file named with the corresponding page title. - Text enclosed within tags has been escaped using XML entities. In order to retrieve the actual human-readable character we highly suggest to unescape it. (e.g. in Java, you can use the StringEscapeUtils class provided by the apache commons-lang API). Please find below more details on the XML format of the sense-annotated files: ============================================================================================================================================== FORMAT OF THE XML FILES ============================================================================================================================================== Each file contains a "wikiArticle" tag, with the attributes "language" (ISO code of the language) and "title" (page title in Wikipedia). The wikiArticle tag contains two main tags: - "text": The plain text of the Wikipedia article - "annotations": A list of "annotation" tags An individual "annotation" tag refers to a sense annotation provided by SEW. Each annotation includes the following attributes: - "babelNetID": The unique sense identifier as provided by the sense inventory (BabelNet); - "mention": The surface form of the mention as it appears in the plain text of the article; - "anchorStart": The token-based starting index (inclusive) of the annotation; - "anchorEnd": The token-based ending index (exclusive) of the annotation; - "type": Sense annotation type symbol (see Table 1 of the reference paper). There are 8 different annotation types: - HL: Original Hyperlink - SP: Surface Mention Propagation - LP: Lemmatized Mention Propagation - PP: Person Mention Propagation - WIL: Wikipedia Inlink Propagation - BIL: BabelNet Inlink Propagation - CP: Category Propagation - MP: Monosemous Content Word Please refer to the paper (Section 4: Propagation Heuristics) for a more detailed description. A sample XML file can be found at: http://lcl.uniroma1.it/sew/sample. ============================================================================================================================================== REFERENCE PAPER ============================================================================================================================================== When using SEW, please refer to the following paper: Alessandro Raganato, Claudio Delli Bovi and Roberto Navigli. Automatic Construction and Evaluation of a Large Semantically Enriched Wikipedia. Proceedings of 25th International Joint Conference on Artificial Intelligence (IJCAI-16), New York City, New York, USA, 9-15 July 2016. ============================================================================================================================================== CONTACT ============================================================================================================================================== For any enquiry related to SEW, please contact: - Alessandro Raganato (raganato [at] di.uniroma1 [dot] it) - Claudio Delli Bovi (dellibovi [at] di.uniroma1 [dot] it) - Roberto Navigli (navigli [at] di.uniroma1 [dot] it) ============================================================================================================================================== LICENSES ============================================================================================================================================== Except for the original Wikilinks (HL), all the annotations in SEW are licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 License. Original Wikilinks (HL) are licensed under the Creative Commons Attribution-ShareAlike 3.0 Unported License (CC-BY-SA).